File size: 7,210 Bytes

---
license: mit
datasets:
- Harley-ml/es-en-words
language:
- en
tags:
- small
- small-language-model
- largeword
- word-generation
- harley-ml
- word
- words
- wordgen
- qwen3
---

# LargeWord

LargeWord is the largest model in the [WordGen](https://huggingface.co/collections/Harley-ml/wordgen) family and has about 1.59M parameters.

LargeWord generates plausible or real words learned from its pretraining dataset.

## Architecture

| Parameter               | Value |
|-------------------------|-------|
| hidden_size             | 160   |
| num_hidden_layers       | 4     |
| num_attention_heads     | 2     |
| num_key_value_heads     | 2     |
| intermediate_size       | 512   |
| max_position_embeddings | 77    |
| rope_theta              | 10000.0 |
| tie_word_embeddings     | True  |
| vocab_size              | 1204  |

## Training

LargeWord was trained on 753,232 words and 4,153,110 tokens. Its goal is to generate plausible-looking or real words.

### Hardware

LargeWord was trained on an NVIDIA RTX 2060 6GB for 2 epochs with a batch size of 8.

### Training Results

| Step | Epoch | Train Loss | Train PPL | Eval Loss | Eval PPL |
|------|-------|------------|-----------|-----------|----------|
| 500  | 0.30  | 4.3276     | 75.74     | 2.4190    | 11.23    |
| 1000 | 0.61  | 1.7151     | 5.56      | 1.4076    | 4.09     |
| 1500 | 0.91  | 1.3247     | 3.76      | 1.2682    | 3.55     |
| 2000 | 1.21  | 1.2120     | 3.36      | 1.2026    | 3.33     |
| 2500 | 1.51  | 1.1619     | 3.20      | 1.1667    | 3.21     |
| 3000 | 1.82  | 1.1314     | 3.10      | 1.1378    | 3.12     |

![Training and Evaluation Curves](images/training_graph.png)

## Generations

Prompt: `w`

Output:
```
weldosfish's
```

Prompt: `app`

Output:
```
appardness
```

Prompt: `z`

Output:
```
zeething's
```

## Use Cases

1. Educational research
2. Morphological/phonetic research
3. Deployment on constrained devices
4. Or, more simply, for fun.

# Inference

```python
# =============================================================================
# Inference
# =============================================================================

MODEL_DIR      = "Harley-ml/LargeWord-1.5M"   # path
TOKENIZER_PATH = MODEL_DIR

# --- Generation settings ---
PROMPT             = "a"   # prompt
MAX_NEW_TOKENS     = 16
TEMPERATURE        = 1.2
TOP_P              = 0.95
TOP_K              = 200
REPETITION_PENALTY = 1.1
DO_SAMPLE          = True

# =============================================================================

import torch
from pathlib import Path
from transformers import (
    AutoModelForCausalLM,
    PreTrainedTokenizerFast,
    AddedToken,
)

# ---------------------------------------------------------------------------
# Device
# ---------------------------------------------------------------------------

device = (
    "cuda" if torch.cuda.is_available() else
    "mps"  if torch.backends.mps.is_available() else
    "cpu"
)
print(f"Device : {device}")

# ---------------------------------------------------------------------------
# Tokenizer  (mirrors training setup)
# ---------------------------------------------------------------------------

def load_tokenizer(path: str):
    p = Path(path).resolve()
    if not p.exists():
        raise FileNotFoundError(f"Tokenizer not found: {p}")
    tok = PreTrainedTokenizerFast(tokenizer_file=str(p))
    specials = {}
    if tok.bos_token is None: specials["bos_token"] = AddedToken("<|bos|>", special=True)
    if tok.eos_token is None: specials["eos_token"] = AddedToken("<|eos|>", special=True)
    if tok.unk_token is None: specials["unk_token"] = AddedToken("<|unk|>", special=True)
    if tok.pad_token is None:
        if tok.eos_token is not None:
            tok.pad_token = tok.eos_token
        else:
            specials["pad_token"] = AddedToken("<|pad|>", special=True)
    if specials:
        tok.add_special_tokens(specials)
    tok.padding_side = "left"  # left-pad for batched generation
    return tok

print("Loading tokenizer...")
tokenizer = load_tokenizer(TOKENIZER_PATH)
print(f"  Vocab size : {tokenizer.vocab_size}")
print(f"  BOS        : {tokenizer.bos_token!r}")
print(f"  EOS        : {tokenizer.eos_token!r}")
print(f"  PAD        : {tokenizer.pad_token!r}  (id={tokenizer.pad_token_id})")

# ---------------------------------------------------------------------------
# Model
# ---------------------------------------------------------------------------

print(f"\nLoading model from {MODEL_DIR} ...")
model = AutoModelForCausalLM.from_pretrained(
    MODEL_DIR,
    dtype=torch.float16 if device == "cuda" else torch.float32,
    low_cpu_mem_usage=True,
)
model.eval()
model.to(device)

total_params = sum(p.numel() for p in model.parameters())
print(f"  Parameters : {total_params:,}")

# ---------------------------------------------------------------------------
# Generation helper
# ---------------------------------------------------------------------------

def generate(
    prompt: str             = PROMPT,
    max_new_tokens: int     = MAX_NEW_TOKENS,
    temperature: float      = TEMPERATURE,
    top_p: float            = TOP_P,
    top_k: int              = TOP_K,
    repetition_penalty: float = REPETITION_PENALTY,
    do_sample: bool         = DO_SAMPLE,
) -> str:
    
    bos         = tokenizer.bos_token or ""
    full_prompt = bos + prompt

    inputs = tokenizer(
        full_prompt,
        return_tensors="pt",
        add_special_tokens=False,
    ).to(device)
    inputs.pop("token_type_ids", None)  # Qwen3 doesn't use this

    gen_kwargs = dict(
        max_new_tokens     = max_new_tokens,
        do_sample          = do_sample,
        repetition_penalty = repetition_penalty,
        eos_token_id       = tokenizer.eos_token_id,
        pad_token_id       = tokenizer.pad_token_id,
    )
    if do_sample:
        gen_kwargs["temperature"] = temperature
        gen_kwargs["top_p"]       = top_p
        gen_kwargs["top_k"]       = top_k

    with torch.inference_mode():
        output_ids = model.generate(**inputs, **gen_kwargs)

    # Strip the prompt tokens so we only return what was generated
    prompt_len = inputs["input_ids"].shape[-1]
    new_ids    = output_ids[0][prompt_len:]
    return tokenizer.decode(new_ids, skip_special_tokens=True)


# ---------------------------------------------------------------------------
# Run
# ---------------------------------------------------------------------------

if __name__ == "__main__":
    print(f"\nPrompt : {PROMPT!r}")
    print("-" * 60)

    output = generate(PROMPT)

    print("Generated:")
    print(output)
```
### Related Models

1. [PicoWord](https://huggingface.co/Harley-ml/PicoWord-5k)
2. [MicroWord](https://huggingface.co/Harley-ml/MicroWord-23k)
3. [TinyWord](https://huggingface.co/Harley-ml/TinyWord-134k)
4. [TinyWord2](https://huggingface.co/Harley-ml/TinyWord2-128k)
5. [MediumWord](https://huggingface.co/Harley-ml/MediumWord-559k)

## Citation

```bibtex
@misc{largeword-1.5m,
  title     = {LargeWord-1.5M: A Test of Morphological Compression in TLMs},
  author    = {Harley-ml},
  year      = {2026},
  url       = {https://huggingface.co/Harley-ml/LargeWord-1.5M}
}
```