Harley-ml commited on
Commit
68c5e66
·
verified ·
1 Parent(s): 9164223

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +255 -2
README.md CHANGED
@@ -1,6 +1,259 @@
1
  ---
2
  license: mit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3
  ---
4
 
5
- I'm on my lunch break. Yes, I take lunch breaks. This will be filled out soon.
6
- Sorry for the wait.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: mit
3
+ language:
4
+ - en
5
+ - es
6
+ tags:
7
+ - small
8
+ - tiny
9
+ - tinyword
10
+ - theword
11
+ - harley-ml
12
+ - small-language-model
13
+ - word-generation
14
+ - word-generator
15
+ - text-generation
16
+ - qwen3
17
  ---
18
 
19
+ # TinyWord-v2-128k
20
+
21
+ TinyWord-v2 is a revamped and retrained version of v1. In v1, we noticed that it didn't use weight-tying, which ate up half of its parameters. This was misleading as it was effectively the same size as MicroWord.
22
+ Anyway, this version achives much better performace compared to v1.
23
+
24
+ ## Architecture
25
+
26
+ | Parameter | Value |
27
+ |---|---|
28
+ | Hidden Layers | 2 |
29
+ | Hidden Size | 48 |
30
+ | Attention Heads | 1 |
31
+ | KV Heads | 1 |
32
+ | Vocab Size | 1,200 |
33
+ | Intermediate Size | 160 |
34
+ | RoPE Theta | 1,000 |
35
+ | Max Position Embeddings | 32 |
36
+ | Tie Word Embeddings | True |
37
+
38
+ ## Training
39
+
40
+ TinyWord-v2 was trained on 753,232 unique words (entries), 3,225,398 tokens, and 7,022,310 characters. ~660k of those words are English, while ~90k of them are Spanish.
41
+
42
+ ### Dataset
43
+
44
+ | Key | Value |
45
+ | :---------------------: | :-------: |
46
+ | Entries (words) | 753,232 |
47
+ | Tokens | 3,225,398 |
48
+ | Characters | 7,022,310 |
49
+ | Avg. Tokens Per Entry | ~4.2 |
50
+ | Avg. Words Per Entry | 1 |
51
+ | Avg. Chars Per Entry | ~9.3 |
52
+ | Longest Entry (Tokens) | 36 |
53
+ | Shortest Entry (Tokens) | 1 |
54
+ | English Words | ~660k |
55
+ | Spanish Words | ~90k |
56
+
57
+
58
+ ### Hardware
59
+
60
+ TinyWord-v2 was trained on a NVIDA RTX 2060 6GB for 6 epochs with a batch size of 32.
61
+
62
+ ### Training Results
63
+
64
+ | Step | Train Loss | Val Loss | Train PPL | Eval PPL |
65
+ |---|---|---|---|---|
66
+ | 2000 | 3.0579 | 2.5138 | 21.28 | 12.35 |
67
+ | 4000 | 2.0494 | 1.9456 | 7.76 | 6.99 |
68
+ | 6000 | 1.8572 | 1.7965 | 6.40 | 6.03 |
69
+ | 8000 | 1.7822 | 1.7294 | 5.94 | 5.64 |
70
+ | 10000 | 1.7360 | 1.6932 | 5.67 | 5.44 |
71
+
72
+ ## Generations
73
+
74
+ Prompt: `w`
75
+
76
+ Output:
77
+ ```
78
+ wrtervulatoration
79
+ ```
80
+
81
+ Prompt: `app`
82
+
83
+ Output:
84
+ ```
85
+ appatating
86
+ ``
87
+
88
+ Prompt: `a`
89
+
90
+ Output:
91
+ ```
92
+ ay's
93
+ ```
94
+
95
+ Prompt: `z`
96
+
97
+ Output:
98
+ ```
99
+ aceae
100
+ ```
101
+
102
+ ## Limitations
103
+
104
+ 1. It does not generate sentences, prose, code, or anything besides a single word-like sequence.
105
+ 2. It cannot reason or produce complex language.
106
+ 3. Generated words may not be real. The goal isn't real word generation but reflecting the lexicon and morphology of the English and Spanish languages through tiny language models.
107
+ 4. Output is non-deterministic. The same prompt can produce very different completions across runs.
108
+
109
+ # Inference
110
+
111
+ ```python
112
+ # =============================================================================
113
+ # Inference
114
+ # =============================================================================
115
+
116
+ MODEL_DIR = "Harley-ml/TinyWord2-128k" # path
117
+ TOKENIZER_PATH = "Harley-ml/TinyWord2-128k"
118
+
119
+ # --- Generation settings ---
120
+ PROMPT = "w" # prompt
121
+ MAX_NEW_TOKENS = 32
122
+ TEMPERATURE = 1.2
123
+ TOP_P = 0.95
124
+ TOP_K = 50
125
+ REPETITION_PENALTY = 1.1
126
+ DO_SAMPLE = True
127
+
128
+ # =============================================================================
129
+
130
+ import torch
131
+ from pathlib import Path
132
+ from transformers import (
133
+ AutoModelForCausalLM,
134
+ PreTrainedTokenizerFast,
135
+ AddedToken,
136
+ )
137
+
138
+ # ---------------------------------------------------------------------------
139
+ # Device
140
+ # ---------------------------------------------------------------------------
141
+
142
+ device = (
143
+ "cuda" if torch.cuda.is_available() else
144
+ "mps" if torch.backends.mps.is_available() else
145
+ "cpu"
146
+ )
147
+ print(f"Device : {device}")
148
+
149
+ # ---------------------------------------------------------------------------
150
+ # Tokenizer (mirrors training setup)
151
+ # ---------------------------------------------------------------------------
152
+
153
+ def load_tokenizer(path: str):
154
+ p = Path(path).resolve()
155
+ if not p.exists():
156
+ raise FileNotFoundError(f"Tokenizer not found: {p}")
157
+ tok = PreTrainedTokenizerFast(tokenizer_file=str(p))
158
+ specials = {}
159
+ if tok.bos_token is None: specials["bos_token"] = AddedToken("<|bos|>", special=True)
160
+ if tok.eos_token is None: specials["eos_token"] = AddedToken("<|eos|>", special=True)
161
+ if tok.unk_token is None: specials["unk_token"] = AddedToken("<|unk|>", special=True)
162
+ if tok.pad_token is None:
163
+ if tok.eos_token is not None:
164
+ tok.pad_token = tok.eos_token
165
+ else:
166
+ specials["pad_token"] = AddedToken("<|pad|>", special=True)
167
+ if specials:
168
+ tok.add_special_tokens(specials)
169
+ tok.padding_side = "left" # left-pad for batched generation
170
+ return tok
171
+
172
+ print("Loading tokenizer...")
173
+ tokenizer = load_tokenizer(TOKENIZER_PATH)
174
+ print(f" Vocab size : {tokenizer.vocab_size}")
175
+ print(f" BOS : {tokenizer.bos_token!r}")
176
+ print(f" EOS : {tokenizer.eos_token!r}")
177
+ print(f" PAD : {tokenizer.pad_token!r} (id={tokenizer.pad_token_id})")
178
+
179
+ # ---------------------------------------------------------------------------
180
+ # Model
181
+ # ---------------------------------------------------------------------------
182
+
183
+ print(f"\nLoading model from {MODEL_DIR} ...")
184
+ model = AutoModelForCausalLM.from_pretrained(
185
+ MODEL_DIR,
186
+ dtype=torch.float16 if device == "cuda" else torch.float32,
187
+ low_cpu_mem_usage=True,
188
+ )
189
+ model.eval()
190
+ model.to(device)
191
+
192
+ total_params = sum(p.numel() for p in model.parameters())
193
+ print(f" Parameters : {total_params:,}")
194
+
195
+ # ---------------------------------------------------------------------------
196
+ # Generation helper
197
+ # ---------------------------------------------------------------------------
198
+
199
+ def generate(
200
+ prompt: str = PROMPT,
201
+ max_new_tokens: int = MAX_NEW_TOKENS,
202
+ temperature: float = TEMPERATURE,
203
+ top_p: float = TOP_P,
204
+ top_k: int = TOP_K,
205
+ repetition_penalty: float = REPETITION_PENALTY,
206
+ do_sample: bool = DO_SAMPLE,
207
+ ) -> str:
208
+
209
+ bos = tokenizer.bos_token or ""
210
+ full_prompt = bos + prompt
211
+
212
+ inputs = tokenizer(
213
+ full_prompt,
214
+ return_tensors="pt",
215
+ add_special_tokens=False,
216
+ ).to(device)
217
+ inputs.pop("token_type_ids", None) # Qwen3 doesn't use this
218
+
219
+ gen_kwargs = dict(
220
+ max_new_tokens = max_new_tokens,
221
+ do_sample = do_sample,
222
+ repetition_penalty = repetition_penalty,
223
+ eos_token_id = tokenizer.eos_token_id,
224
+ pad_token_id = tokenizer.pad_token_id,
225
+ )
226
+ if do_sample:
227
+ gen_kwargs["temperature"] = temperature
228
+ gen_kwargs["top_p"] = top_p
229
+ gen_kwargs["top_k"] = top_k
230
+
231
+ with torch.inference_mode():
232
+ output_ids = model.generate(**inputs, **gen_kwargs)
233
+
234
+ # Strip the prompt tokens so we only return what was generated
235
+ prompt_len = inputs["input_ids"].shape[-1]
236
+ new_ids = output_ids[0][prompt_len:]
237
+ return tokenizer.decode(new_ids, skip_special_tokens=True)
238
+
239
+
240
+ # ---------------------------------------------------------------------------
241
+ # Run
242
+ # ---------------------------------------------------------------------------
243
+
244
+ if __name__ == "__main__":
245
+ print(f"\nPrompt : {PROMPT!r}")
246
+ print("-" * 60)
247
+
248
+ output = generate(PROMPT)
249
+
250
+ print("Generated:")
251
+ print(output)
252
+ ```
253
+
254
+ ### Related Models
255
+
256
+ 1. [PicoWord](https://huggingface.co/Harley-ml/PicoWord-5k)
257
+ 2. [MicroWord](https://huggingface.co/Harley-ml/MicroWord-23k)
258
+ 3. [TinyWord](https://huggingface.co/Harley-ml/TinyWord-134k)
259
+ 4. [MediumWord](https://huggingface.co/Harley-ml/MediumWord-559k)