--- license: mit datasets: - Harley-ml/es-en-words language: - en tags: - small - small-language-model - largeword - word-generation - harley-ml - word - words - wordgen - qwen3 --- # LargeWord LargeWord is the largest model in the [WordGen](https://huggingface.co/collections/Harley-ml/wordgen) family and has about 1.59M parameters. LargeWord generates plausible or real words learned from its pretraining dataset. ## Architecture | Parameter | Value | |-------------------------|-------| | hidden_size | 160 | | num_hidden_layers | 4 | | num_attention_heads | 2 | | num_key_value_heads | 2 | | intermediate_size | 512 | | max_position_embeddings | 77 | | rope_theta | 10000.0 | | tie_word_embeddings | True | | vocab_size | 1204 | ## Training LargeWord was trained on 753,232 words and 4,153,110 tokens. Its goal is to generate plausible-looking or real words. ### Hardware LargeWord was trained on an NVIDIA RTX 2060 6GB for 2 epochs with a batch size of 8. ### Training Results | Step | Epoch | Train Loss | Train PPL | Eval Loss | Eval PPL | |------|-------|------------|-----------|-----------|----------| | 500 | 0.30 | 4.3276 | 75.74 | 2.4190 | 11.23 | | 1000 | 0.61 | 1.7151 | 5.56 | 1.4076 | 4.09 | | 1500 | 0.91 | 1.3247 | 3.76 | 1.2682 | 3.55 | | 2000 | 1.21 | 1.2120 | 3.36 | 1.2026 | 3.33 | | 2500 | 1.51 | 1.1619 | 3.20 | 1.1667 | 3.21 | | 3000 | 1.82 | 1.1314 | 3.10 | 1.1378 | 3.12 | ![Training and Evaluation Curves](images/training_graph.png) ## Generations Prompt: `w` Output: ``` weldosfish's ``` Prompt: `app` Output: ``` appardness ``` Prompt: `z` Output: ``` zeething's ``` ## Use Cases 1. Educational research 2. Morphological/phonetic research 3. Deployment on constrained devices 4. Or, more simply, for fun. # Inference ```python # ============================================================================= # Inference # ============================================================================= MODEL_DIR = "Harley-ml/LargeWord-1.5M" # path TOKENIZER_PATH = MODEL_DIR # --- Generation settings --- PROMPT = "a" # prompt MAX_NEW_TOKENS = 16 TEMPERATURE = 1.2 TOP_P = 0.95 TOP_K = 200 REPETITION_PENALTY = 1.1 DO_SAMPLE = True # ============================================================================= import torch from pathlib import Path from transformers import ( AutoModelForCausalLM, PreTrainedTokenizerFast, AddedToken, ) # --------------------------------------------------------------------------- # Device # --------------------------------------------------------------------------- device = ( "cuda" if torch.cuda.is_available() else "mps" if torch.backends.mps.is_available() else "cpu" ) print(f"Device : {device}") # --------------------------------------------------------------------------- # Tokenizer (mirrors training setup) # --------------------------------------------------------------------------- def load_tokenizer(path: str): p = Path(path).resolve() if not p.exists(): raise FileNotFoundError(f"Tokenizer not found: {p}") tok = PreTrainedTokenizerFast(tokenizer_file=str(p)) specials = {} if tok.bos_token is None: specials["bos_token"] = AddedToken("<|bos|>", special=True) if tok.eos_token is None: specials["eos_token"] = AddedToken("<|eos|>", special=True) if tok.unk_token is None: specials["unk_token"] = AddedToken("<|unk|>", special=True) if tok.pad_token is None: if tok.eos_token is not None: tok.pad_token = tok.eos_token else: specials["pad_token"] = AddedToken("<|pad|>", special=True) if specials: tok.add_special_tokens(specials) tok.padding_side = "left" # left-pad for batched generation return tok print("Loading tokenizer...") tokenizer = load_tokenizer(TOKENIZER_PATH) print(f" Vocab size : {tokenizer.vocab_size}") print(f" BOS : {tokenizer.bos_token!r}") print(f" EOS : {tokenizer.eos_token!r}") print(f" PAD : {tokenizer.pad_token!r} (id={tokenizer.pad_token_id})") # --------------------------------------------------------------------------- # Model # --------------------------------------------------------------------------- print(f"\nLoading model from {MODEL_DIR} ...") model = AutoModelForCausalLM.from_pretrained( MODEL_DIR, dtype=torch.float16 if device == "cuda" else torch.float32, low_cpu_mem_usage=True, ) model.eval() model.to(device) total_params = sum(p.numel() for p in model.parameters()) print(f" Parameters : {total_params:,}") # --------------------------------------------------------------------------- # Generation helper # --------------------------------------------------------------------------- def generate( prompt: str = PROMPT, max_new_tokens: int = MAX_NEW_TOKENS, temperature: float = TEMPERATURE, top_p: float = TOP_P, top_k: int = TOP_K, repetition_penalty: float = REPETITION_PENALTY, do_sample: bool = DO_SAMPLE, ) -> str: bos = tokenizer.bos_token or "" full_prompt = bos + prompt inputs = tokenizer( full_prompt, return_tensors="pt", add_special_tokens=False, ).to(device) inputs.pop("token_type_ids", None) # Qwen3 doesn't use this gen_kwargs = dict( max_new_tokens = max_new_tokens, do_sample = do_sample, repetition_penalty = repetition_penalty, eos_token_id = tokenizer.eos_token_id, pad_token_id = tokenizer.pad_token_id, ) if do_sample: gen_kwargs["temperature"] = temperature gen_kwargs["top_p"] = top_p gen_kwargs["top_k"] = top_k with torch.inference_mode(): output_ids = model.generate(**inputs, **gen_kwargs) # Strip the prompt tokens so we only return what was generated prompt_len = inputs["input_ids"].shape[-1] new_ids = output_ids[0][prompt_len:] return tokenizer.decode(new_ids, skip_special_tokens=True) # --------------------------------------------------------------------------- # Run # --------------------------------------------------------------------------- if __name__ == "__main__": print(f"\nPrompt : {PROMPT!r}") print("-" * 60) output = generate(PROMPT) print("Generated:") print(output) ``` ### Related Models 1. [PicoWord](https://huggingface.co/Harley-ml/PicoWord-5k) 2. [MicroWord](https://huggingface.co/Harley-ml/MicroWord-23k) 3. [TinyWord](https://huggingface.co/Harley-ml/TinyWord-134k) 4. [TinyWord2](https://huggingface.co/Harley-ml/TinyWord2-128k) 5. [MediumWord](https://huggingface.co/Harley-ml/MediumWord-559k) ## Citation ```bibtex @misc{largeword-1.5m, title = {LargeWord-1.5M: A Test of Morphological Compression in TLMs}, author = {Harley-ml}, year = {2026}, url = {https://huggingface.co/Harley-ml/LargeWord-1.5M} } ```