Text Generation
Transformers
Safetensors
English
tinybuddy
tiny-lm
tinystories
educational
built-with-llama
small-model
custom_code
Instructions to use Eeppa/TinyBuddy-500K with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Eeppa/TinyBuddy-500K with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="Eeppa/TinyBuddy-500K", trust_remote_code=True)# Load model directly from transformers import AutoModelForCausalLM model = AutoModelForCausalLM.from_pretrained("Eeppa/TinyBuddy-500K", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use Eeppa/TinyBuddy-500K with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "Eeppa/TinyBuddy-500K" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Eeppa/TinyBuddy-500K", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/Eeppa/TinyBuddy-500K
- SGLang
How to use Eeppa/TinyBuddy-500K with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "Eeppa/TinyBuddy-500K" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Eeppa/TinyBuddy-500K", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "Eeppa/TinyBuddy-500K" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Eeppa/TinyBuddy-500K", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use Eeppa/TinyBuddy-500K with Docker Model Runner:
docker model run hf.co/Eeppa/TinyBuddy-500K
Upload 12 files
Browse files- README.md +103 -0
- __init__.py +5 -0
- config.json +23 -0
- configuration_tinybuddy.py +39 -0
- generation_config.json +9 -0
- merges.txt +23 -0
- model.safetensors +3 -0
- modeling_tinybuddy.py +153 -0
- special_tokens_map.json +6 -0
- tokenizer.json +23 -0
- tokenizer_config.json +8 -0
- vocab.json +1 -0
README.md
ADDED
|
@@ -0,0 +1,103 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: mit
|
| 3 |
+
language:
|
| 4 |
+
- en
|
| 5 |
+
library_name: transformers
|
| 6 |
+
tags:
|
| 7 |
+
- text-generation
|
| 8 |
+
- tiny-lm
|
| 9 |
+
- tinystories
|
| 10 |
+
- educational
|
| 11 |
+
- built-with-llama
|
| 12 |
+
- small-model
|
| 13 |
+
pipeline_tag: text-generation
|
| 14 |
+
datasets:
|
| 15 |
+
- roneneldan/TinyStories
|
| 16 |
+
---
|
| 17 |
+
|
| 18 |
+
# TinyBuddy-500K
|
| 19 |
+
|
| 20 |
+
> ⚠️ **Educational / experimental model.** TinyBuddy-500K is a from-scratch tiny Llama-style language model (~547K parameters) trained on a synthetic slice of TinyStories-style text.
|
| 21 |
+
> It is **not** a useful assistant — it is a working demonstration of training extremely small models from scratch. See the [Limitations](#limitations) section.
|
| 22 |
+
|
| 23 |
+
## Model description
|
| 24 |
+
|
| 25 |
+
TinyBuddy-500K is a very small decoder-only Transformer language model trained on synthetic children's stories in the style of [TinyStories](https://huggingface.co/datasets/roneneldan/TinyStories). The architecture follows the LLaMA design (RMSNorm, Grouped Query Attention, SiLU MLP, tied embeddings).
|
| 26 |
+
|
| 27 |
+
| Hyperparameter | Value |
|
| 28 |
+
|-------------------------|--------------------------------|
|
| 29 |
+
| Parameters | **547,296** (~547K) |
|
| 30 |
+
| Layers | 2 |
|
| 31 |
+
| Attention heads | 4 |
|
| 32 |
+
| Key-Value heads (GQA) | 2 |
|
| 33 |
+
| Hidden size | 96 |
|
| 34 |
+
| MLP intermediate size | 384 |
|
| 35 |
+
| Context length | 512 |
|
| 36 |
+
| Vocab size | 2,048 (BPE trained from scratch) |
|
| 37 |
+
| Norm | RMSNorm |
|
| 38 |
+
| Activation | SiLU |
|
| 39 |
+
| Position embeddings | Learned absolute |
|
| 40 |
+
| Weight tying | Yes (tied embeddings) |
|
| 41 |
+
| Precision | float32 |
|
| 42 |
+
|
| 43 |
+
## Training details
|
| 44 |
+
|
| 45 |
+
- **Data**: Synthetic TinyStories-style corpus (~128K tokens)
|
| 46 |
+
- **Tokenizer**: Custom byte-level BPE with 2048 vocabulary
|
| 47 |
+
- **Optimizer**: AdamW
|
| 48 |
+
- **Steps**: ~300 steps on CPU
|
| 49 |
+
- **Hardware**: Single CPU core
|
| 50 |
+
- **Final loss**: ~0.17
|
| 51 |
+
|
| 52 |
+
## Usage
|
| 53 |
+
|
| 54 |
+
This model uses **custom modeling code**, so you must pass `trust_remote_code=True`.
|
| 55 |
+
|
| 56 |
+
```python
|
| 57 |
+
from transformers import AutoModelForCausalLM, AutoTokenizer
|
| 58 |
+
import torch
|
| 59 |
+
|
| 60 |
+
repo = "Eeppa/TinyBuddy-500K"
|
| 61 |
+
|
| 62 |
+
tokenizer = AutoTokenizer.from_pretrained(repo, trust_remote_code=True)
|
| 63 |
+
model = AutoModelForCausalLM.from_pretrained(repo, trust_remote_code=True)
|
| 64 |
+
model.eval()
|
| 65 |
+
|
| 66 |
+
prompt = "Once upon a time, there was a little girl named Lily."
|
| 67 |
+
input_ids = tokenizer.encode(prompt, return_tensors="pt")
|
| 68 |
+
|
| 69 |
+
out = model.generate(input_ids, max_new_tokens=60, temperature=0.8, top_k=50)
|
| 70 |
+
print(tokenizer.decode(out[0], skip_special_tokens=True))
|
| 71 |
+
```
|
| 72 |
+
|
| 73 |
+
## Limitations
|
| 74 |
+
|
| 75 |
+
This model is extremely small and was trained for a very short time on limited data.
|
| 76 |
+
|
| 77 |
+
**What works**:
|
| 78 |
+
- Basic English patterns and short sentence structure
|
| 79 |
+
- Simple story-like generation
|
| 80 |
+
|
| 81 |
+
**What's broken**:
|
| 82 |
+
- Very limited coherence (usually breaks after 1–2 sentences)
|
| 83 |
+
- High repetition
|
| 84 |
+
- Poor long-range consistency
|
| 85 |
+
- No real reasoning or factual knowledge
|
| 86 |
+
|
| 87 |
+
This model exists purely for educational purposes to explore the lower limits of language model size.
|
| 88 |
+
|
| 89 |
+
## License
|
| 90 |
+
|
| 91 |
+
MIT
|
| 92 |
+
|
| 93 |
+
## Citation
|
| 94 |
+
|
| 95 |
+
```bibtex
|
| 96 |
+
@misc{tinybuddy500k,
|
| 97 |
+
title = {TinyBuddy-500K: An educational ~500K parameter Llama-style model trained on TinyStories},
|
| 98 |
+
year = {2026},
|
| 99 |
+
note = {Educational demonstration of extremely small language models.}
|
| 100 |
+
}
|
| 101 |
+
```
|
| 102 |
+
|
| 103 |
+
**Built with Llama.**
|
__init__.py
ADDED
|
@@ -0,0 +1,5 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# TinyBuddy-500K package
|
| 2 |
+
from .modeling_tinybuddy import TinyBuddyForCausalLM
|
| 3 |
+
from .configuration_tinybuddy import TinyBuddyConfig
|
| 4 |
+
|
| 5 |
+
__all__ = ["TinyBuddyForCausalLM", "TinyBuddyConfig"]
|
config.json
ADDED
|
@@ -0,0 +1,23 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"_name_or_path": "Eeppa/TinyBuddy-500K",
|
| 3 |
+
"architectures": ["TinyBuddyForCausalLM"],
|
| 4 |
+
"auto_map": {
|
| 5 |
+
"AutoConfig": "configuration_tinybuddy.TinyBuddyConfig",
|
| 6 |
+
"AutoModelForCausalLM": "modeling_tinybuddy.TinyBuddyForCausalLM"
|
| 7 |
+
},
|
| 8 |
+
"model_type": "tinybuddy",
|
| 9 |
+
"vocab_size": 2048,
|
| 10 |
+
"hidden_size": 96,
|
| 11 |
+
"num_hidden_layers": 2,
|
| 12 |
+
"num_attention_heads": 4,
|
| 13 |
+
"num_key_value_heads": 2,
|
| 14 |
+
"intermediate_size": 384,
|
| 15 |
+
"max_position_embeddings": 512,
|
| 16 |
+
"rms_norm_eps": 1e-6,
|
| 17 |
+
"tie_word_embeddings": true,
|
| 18 |
+
"bos_token_id": 2,
|
| 19 |
+
"eos_token_id": 2,
|
| 20 |
+
"pad_token_id": 0,
|
| 21 |
+
"transformers_version": "4.40.0",
|
| 22 |
+
"torch_dtype": "float32"
|
| 23 |
+
}
|
configuration_tinybuddy.py
ADDED
|
@@ -0,0 +1,39 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
TinyBuddyConfig for TinyBuddy-500K
|
| 3 |
+
"""
|
| 4 |
+
|
| 5 |
+
from transformers import PretrainedConfig
|
| 6 |
+
|
| 7 |
+
|
| 8 |
+
class TinyBuddyConfig(PretrainedConfig):
|
| 9 |
+
model_type = "tinybuddy"
|
| 10 |
+
|
| 11 |
+
def __init__(
|
| 12 |
+
self,
|
| 13 |
+
vocab_size=2048,
|
| 14 |
+
hidden_size=96,
|
| 15 |
+
num_hidden_layers=2,
|
| 16 |
+
num_attention_heads=4,
|
| 17 |
+
num_key_value_heads=2,
|
| 18 |
+
intermediate_size=384,
|
| 19 |
+
max_position_embeddings=512,
|
| 20 |
+
rms_norm_eps=1e-6,
|
| 21 |
+
tie_word_embeddings=True,
|
| 22 |
+
bos_token_id=2,
|
| 23 |
+
eos_token_id=2,
|
| 24 |
+
pad_token_id=0,
|
| 25 |
+
**kwargs,
|
| 26 |
+
):
|
| 27 |
+
super().__init__(**kwargs)
|
| 28 |
+
self.vocab_size = vocab_size
|
| 29 |
+
self.hidden_size = hidden_size
|
| 30 |
+
self.num_hidden_layers = num_hidden_layers
|
| 31 |
+
self.num_attention_heads = num_attention_heads
|
| 32 |
+
self.num_key_value_heads = num_key_value_heads
|
| 33 |
+
self.intermediate_size = intermediate_size
|
| 34 |
+
self.max_position_embeddings = max_position_embeddings
|
| 35 |
+
self.rms_norm_eps = rms_norm_eps
|
| 36 |
+
self.tie_word_embeddings = tie_word_embeddings
|
| 37 |
+
self.bos_token_id = bos_token_id
|
| 38 |
+
self.eos_token_id = eos_token_id
|
| 39 |
+
self.pad_token_id = pad_token_id
|
generation_config.json
ADDED
|
@@ -0,0 +1,9 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"max_new_tokens": 80,
|
| 3 |
+
"temperature": 0.8,
|
| 4 |
+
"top_k": 50,
|
| 5 |
+
"do_sample": true,
|
| 6 |
+
"eos_token_id": 2,
|
| 7 |
+
"pad_token_id": 0,
|
| 8 |
+
"repetition_penalty": 1.1
|
| 9 |
+
}
|
merges.txt
ADDED
|
@@ -0,0 +1,23 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
#version: 0.2
|
| 2 |
+
a e
|
| 3 |
+
t h
|
| 4 |
+
i n
|
| 5 |
+
o n
|
| 6 |
+
s t
|
| 7 |
+
r e
|
| 8 |
+
l e
|
| 9 |
+
d e
|
| 10 |
+
u s
|
| 11 |
+
m e
|
| 12 |
+
w a
|
| 13 |
+
f o
|
| 14 |
+
g o
|
| 15 |
+
y o
|
| 16 |
+
p a
|
| 17 |
+
b e
|
| 18 |
+
k i
|
| 19 |
+
v e
|
| 20 |
+
j u
|
| 21 |
+
x a
|
| 22 |
+
z e
|
| 23 |
+
q u
|
model.safetensors
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:79cbf4a0790677946075a0cb32c455f830699535ff46adefd89c811b66b2593b
|
| 3 |
+
size 2977648
|
modeling_tinybuddy.py
ADDED
|
@@ -0,0 +1,153 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
TinyBuddy-500K: Educational ~500K parameter Llama-style model
|
| 3 |
+
MIT License
|
| 4 |
+
"""
|
| 5 |
+
|
| 6 |
+
from dataclasses import dataclass
|
| 7 |
+
from typing import Optional
|
| 8 |
+
|
| 9 |
+
import torch
|
| 10 |
+
import torch.nn as nn
|
| 11 |
+
import torch.nn.functional as F
|
| 12 |
+
from transformers import PreTrainedModel, PretrainedConfig
|
| 13 |
+
from transformers.modeling_outputs import CausalLMOutputWithPast
|
| 14 |
+
|
| 15 |
+
|
| 16 |
+
@dataclass
|
| 17 |
+
class TinyBuddyConfig(PretrainedConfig):
|
| 18 |
+
model_type = "tinybuddy"
|
| 19 |
+
|
| 20 |
+
vocab_size: int = 2048
|
| 21 |
+
hidden_size: int = 96
|
| 22 |
+
num_hidden_layers: int = 2
|
| 23 |
+
num_attention_heads: int = 4
|
| 24 |
+
num_key_value_heads: int = 2
|
| 25 |
+
intermediate_size: int = 384
|
| 26 |
+
max_position_embeddings: int = 512
|
| 27 |
+
rms_norm_eps: float = 1e-6
|
| 28 |
+
tie_word_embeddings: bool = True
|
| 29 |
+
bos_token_id: int = 2
|
| 30 |
+
eos_token_id: int = 2
|
| 31 |
+
|
| 32 |
+
def __init__(self, **kwargs):
|
| 33 |
+
super().__init__(**kwargs)
|
| 34 |
+
for k, v in kwargs.items():
|
| 35 |
+
setattr(self, k, v)
|
| 36 |
+
|
| 37 |
+
|
| 38 |
+
class RMSNorm(nn.Module):
|
| 39 |
+
def __init__(self, hidden_size, eps=1e-6):
|
| 40 |
+
super().__init__()
|
| 41 |
+
self.weight = nn.Parameter(torch.ones(hidden_size))
|
| 42 |
+
self.eps = eps
|
| 43 |
+
|
| 44 |
+
def forward(self, x):
|
| 45 |
+
variance = x.pow(2).mean(-1, keepdim=True)
|
| 46 |
+
x = x * torch.rsqrt(variance + self.eps)
|
| 47 |
+
return self.weight * x
|
| 48 |
+
|
| 49 |
+
|
| 50 |
+
class GroupedQueryAttention(nn.Module):
|
| 51 |
+
def __init__(self, config):
|
| 52 |
+
super().__init__()
|
| 53 |
+
self.num_heads = config.num_attention_heads
|
| 54 |
+
self.num_kv_heads = config.num_key_value_heads
|
| 55 |
+
self.head_dim = config.hidden_size // self.num_heads
|
| 56 |
+
|
| 57 |
+
self.q_proj = nn.Linear(config.hidden_size, self.num_heads * self.head_dim, bias=False)
|
| 58 |
+
self.k_proj = nn.Linear(config.hidden_size, self.num_kv_heads * self.head_dim, bias=False)
|
| 59 |
+
self.v_proj = nn.Linear(config.hidden_size, self.num_kv_heads * self.head_dim, bias=False)
|
| 60 |
+
self.o_proj = nn.Linear(self.num_heads * self.head_dim, config.hidden_size, bias=False)
|
| 61 |
+
|
| 62 |
+
def forward(self, x):
|
| 63 |
+
B, T, _ = x.shape
|
| 64 |
+
q = self.q_proj(x).view(B, T, self.num_heads, self.head_dim).transpose(1, 2)
|
| 65 |
+
k = self.k_proj(x).view(B, T, self.num_kv_heads, self.head_dim).transpose(1, 2)
|
| 66 |
+
v = self.v_proj(x).view(B, T, self.num_kv_heads, self.head_dim).transpose(1, 2)
|
| 67 |
+
|
| 68 |
+
k = k.repeat_interleave(self.num_heads // self.num_kv_heads, dim=1)
|
| 69 |
+
v = v.repeat_interleave(self.num_heads // self.num_kv_heads, dim=1)
|
| 70 |
+
|
| 71 |
+
scores = torch.matmul(q, k.transpose(-2, -1)) / (self.head_dim ** 0.5)
|
| 72 |
+
attn = F.softmax(scores, dim=-1)
|
| 73 |
+
out = torch.matmul(attn, v)
|
| 74 |
+
out = out.transpose(1, 2).contiguous().view(B, T, self.num_heads * self.head_dim)
|
| 75 |
+
return self.o_proj(out)
|
| 76 |
+
|
| 77 |
+
|
| 78 |
+
class MLP(nn.Module):
|
| 79 |
+
def __init__(self, config):
|
| 80 |
+
super().__init__()
|
| 81 |
+
self.gate_proj = nn.Linear(config.hidden_size, config.intermediate_size, bias=False)
|
| 82 |
+
self.up_proj = nn.Linear(config.hidden_size, config.intermediate_size, bias=False)
|
| 83 |
+
self.down_proj = nn.Linear(config.intermediate_size, config.hidden_size, bias=False)
|
| 84 |
+
|
| 85 |
+
def forward(self, x):
|
| 86 |
+
return self.down_proj(F.silu(self.gate_proj(x)) * self.up_proj(x))
|
| 87 |
+
|
| 88 |
+
|
| 89 |
+
class DecoderLayer(nn.Module):
|
| 90 |
+
def __init__(self, config):
|
| 91 |
+
super().__init__()
|
| 92 |
+
self.self_attn = GroupedQueryAttention(config)
|
| 93 |
+
self.mlp = MLP(config)
|
| 94 |
+
self.input_layernorm = RMSNorm(config.hidden_size, config.rms_norm_eps)
|
| 95 |
+
self.post_attention_layernorm = RMSNorm(config.hidden_size, config.rms_norm_eps)
|
| 96 |
+
|
| 97 |
+
def forward(self, x):
|
| 98 |
+
residual = x
|
| 99 |
+
x = self.input_layernorm(x)
|
| 100 |
+
x = self.self_attn(x)
|
| 101 |
+
x = residual + x
|
| 102 |
+
|
| 103 |
+
residual = x
|
| 104 |
+
x = self.post_attention_layernorm(x)
|
| 105 |
+
x = self.mlp(x)
|
| 106 |
+
x = residual + x
|
| 107 |
+
return x
|
| 108 |
+
|
| 109 |
+
|
| 110 |
+
class TinyBuddyForCausalLM(PreTrainedModel):
|
| 111 |
+
config_class = TinyBuddyConfig
|
| 112 |
+
base_model_prefix = "tinybuddy"
|
| 113 |
+
|
| 114 |
+
def __init__(self, config):
|
| 115 |
+
super().__init__(config)
|
| 116 |
+
self.config = config
|
| 117 |
+
self.embed_tokens = nn.Embedding(config.vocab_size, config.hidden_size)
|
| 118 |
+
self.layers = nn.ModuleList([DecoderLayer(config) for _ in range(config.num_hidden_layers)])
|
| 119 |
+
self.norm = RMSNorm(config.hidden_size, config.rms_norm_eps)
|
| 120 |
+
self.lm_head = nn.Linear(config.hidden_size, config.vocab_size, bias=False)
|
| 121 |
+
|
| 122 |
+
if config.tie_word_embeddings:
|
| 123 |
+
self.lm_head.weight = self.embed_tokens.weight
|
| 124 |
+
|
| 125 |
+
self.post_init()
|
| 126 |
+
|
| 127 |
+
def forward(self, input_ids, labels=None, **kwargs):
|
| 128 |
+
x = self.embed_tokens(input_ids)
|
| 129 |
+
for layer in self.layers:
|
| 130 |
+
x = layer(x)
|
| 131 |
+
x = self.norm(x)
|
| 132 |
+
logits = self.lm_head(x)
|
| 133 |
+
|
| 134 |
+
loss = None
|
| 135 |
+
if labels is not None:
|
| 136 |
+
loss = F.cross_entropy(logits.view(-1, logits.size(-1)), labels.view(-1))
|
| 137 |
+
|
| 138 |
+
return CausalLMOutputWithPast(loss=loss, logits=logits)
|
| 139 |
+
|
| 140 |
+
@torch.no_grad()
|
| 141 |
+
def generate(self, input_ids, max_new_tokens=50, temperature=0.8, top_k=50, **kwargs):
|
| 142 |
+
for _ in range(max_new_tokens):
|
| 143 |
+
logits = self(input_ids).logits[:, -1, :] / temperature
|
| 144 |
+
if top_k is not None:
|
| 145 |
+
v, _ = torch.topk(logits, min(top_k, logits.size(-1)))
|
| 146 |
+
logits[logits < v[:, [-1]]] = -float("Inf")
|
| 147 |
+
probs = F.softmax(logits, dim=-1)
|
| 148 |
+
next_token = torch.multinomial(probs, num_samples=1)
|
| 149 |
+
input_ids = torch.cat([input_ids, next_token], dim=1)
|
| 150 |
+
return input_ids
|
| 151 |
+
|
| 152 |
+
|
| 153 |
+
TinyBuddyForCausalLM.register_for_auto_class("AutoModelForCausalLM")
|
special_tokens_map.json
ADDED
|
@@ -0,0 +1,6 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"bos_token": "<s>",
|
| 3 |
+
"eos_token": "</s>",
|
| 4 |
+
"pad_token": "<pad>",
|
| 5 |
+
"unk_token": "<unk>"
|
| 6 |
+
}
|
tokenizer.json
ADDED
|
@@ -0,0 +1,23 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"version": "1.0",
|
| 3 |
+
"truncation": null,
|
| 4 |
+
"padding": null,
|
| 5 |
+
"added_tokens": [
|
| 6 |
+
{"id": 50256, "content": "<|endoftext|>", "special": true, "single_word": false, "lstrip": false, "rstrip": false, "normalized": false}
|
| 7 |
+
],
|
| 8 |
+
"normalizer": null,
|
| 9 |
+
"pre_tokenizer": {"type": "ByteLevel", "add_prefix_space": false, "use_regex": true},
|
| 10 |
+
"post_processor": null,
|
| 11 |
+
"decoder": {"type": "ByteLevel"},
|
| 12 |
+
"model": {
|
| 13 |
+
"type": "BPE",
|
| 14 |
+
"dropout": null,
|
| 15 |
+
"unk_token": null,
|
| 16 |
+
"continuing_subword_prefix": "",
|
| 17 |
+
"end_of_word_suffix": "",
|
| 18 |
+
"fuse_unk": false,
|
| 19 |
+
"byte_fallback": false,
|
| 20 |
+
"vocab": {},
|
| 21 |
+
"merges": []
|
| 22 |
+
}
|
| 23 |
+
}
|
tokenizer_config.json
ADDED
|
@@ -0,0 +1,8 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"tokenizer_class": "PreTrainedTokenizerFast",
|
| 3 |
+
"model_max_length": 512,
|
| 4 |
+
"bos_token": "<s>",
|
| 5 |
+
"eos_token": "</s>",
|
| 6 |
+
"pad_token": "<pad>",
|
| 7 |
+
"unk_token": "<unk>"
|
| 8 |
+
}
|
vocab.json
ADDED
|
@@ -0,0 +1 @@
|
|
|
|
|
|
|
| 1 |
+
{"<pad>": 0, "<unk>": 1, "<s>": 2, "</s>": 3, "a": 4, "e": 5, "i": 6, "o": 7, "t": 8, "n": 9, "s": 10, "r": 11, "h": 12, "l": 13, "d": 14, "u": 15, "c": 16, "m": 17, "w": 18, "f": 19, "g": 20, "y": 21, "p": 22, "b": 23, "k": 24, "v": 25, "j": 26, "x": 27, "z": 28, "q": 29}
|