korean-wiki-lm-195m
A 195M parameter decoder-only Transformer language model trained entirely from scratch on Korean Wikipedia (kowiki). Both the byte-level BPE tokenizer and the model architecture are written from scratch in PyTorch.
Model Details
| Parameters | 194,482,944 (unique, with weight tying) |
| Architecture | LLaMA-style decoder-only Transformer |
| Vocab size | 32,000 (byte-level BPE) |
| Context length | 1,024 tokens |
| Hidden dim | 768 |
| Layers | 24 |
| Attention heads | 12 |
| FFN dim | 2,048 (SwiGLU) |
| Normalization | RMSNorm (pre-norm) |
| Position encoding | Rotary Position Embeddings (RoPE) |
| Weight tying | Embedding β LM head |
Training
| Data | Korean Wikipedia dump (March 2026), 502,127 articles |
| Clean text | ~1.12 GB across 12 shards |
| Tokens | 230,722,476 |
| Epochs | 5 (~1.15B tokens processed) |
| Optimizer | AdamW (lr=3e-4, cosine decay to 3e-5, warmup 2000 steps) |
| Batch size | 128 sequences (16 micro-batch Γ 8 accumulation steps) |
| Precision | bf16 mixed precision |
| Hardware | Single NVIDIA RTX 4080 (16 GB VRAM) |
| Training time | ~12 hours |
| Final loss | ~2.97 |
Loss Curve
Step 0 (epoch 0): 10.53
Step 500 (epoch 0): 6.71
Step 1000 (epoch 0): 5.48
Step 2000 (epoch 1): 4.08
Step 3000 (epoch 1): 3.67
Step 4000 (epoch 2): 3.41
Step 5000 (epoch 2): 3.31
Step 6000 (epoch 3): 3.12
Step 7000 (epoch 3): 3.10
Step 8000 (epoch 4): 2.99
Step 8800 (epoch 4): 2.97
Tokenizer
Byte-level BPE tokenizer trained from scratch on 100 MB of Korean Wikipedia text. Starts from 256 byte-level tokens + 3 special tokens, learns 31,741 merges to reach a vocabulary of 32,000 tokens.
Pre-tokenization regex (Korean-focused):
?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+
Usage
Standalone PyTorch model, no HuggingFace Transformers dependency needed:
import torch
from config import Config
from model import KoGPT
from tokenizer import BPETokenizer
from generate import generate
cfg = Config()
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
tokenizer = BPETokenizer(cfg)
tokenizer.load(".")
model = KoGPT(cfg).to(device)
ckpt = torch.load("model.pt", map_location=device, weights_only=False)
model.load_state_dict(ckpt["model"])
model.eval()
output = generate(
model, tokenizer,
prompt="νκ΅μ μμ¬λ",
max_tokens=200,
temperature=0.8,
top_k=50,
top_p=0.9,
device=device,
)
print(output)
Example Outputs
These are outputs generated with temperature=0.8, top_k=50, top_p=0.9, seed=42.
Prompt: "μμΈνΉλ³μλ"
μμΈνΉλ³μλ 'μμΈνΉλ³μμ μν λνλ―Όκ΅ μ 2λμμΈ μμΈμλ₯Ό μ€μ¬μΌλ‘ ν κ΄μμμΉλ¨μ²΄'μ΄λ€. μμ¬ 1950λ 8μ 1μΌμ μμΈμκ° λνλ―Όκ΅ μ 2λμλ‘ μΉκ²©νλ©΄μ, μμΈνΉλ³μ(μμΈ)μ κ²½κΈ°λ(μμΈ)μ κ²½κ³μ μ λ°λΌμ μ€μΉλμλ€. κ·Έ ν, 1989λ 1μ 1μΌμ μμΈνΉλ³μλ‘ νΈμ λ κ²½κΈ°λμμ λΆλ¦¬λμλ€.
Prompt: "μΈκ³΅μ§λ₯μ"
μΈκ³΅μ§λ₯μ μ§λ₯μ΄ μλ, μΈκ³΅ μ§λ₯(AI)μ μ¬μ©νμ¬ μΈκ°μ λ₯λ ₯μ ν₯μμν€λ κΈ°μ μ΄λ€. μΈκ³΅ μ§λ₯μ μΈκ°μ λ₯λ ₯μ λν μ§μκ³Ό μ§λ₯μ λͺ¨λ ν¬ν¨νλ€. AIλ μΈκ³΅μ§λ₯μ νμ λΆμΌλ‘ κ°μ£Όλ μ μλ€. μΈκ³΅μ§λ₯μ μ μ©μ μΈκ° μ§λ₯μ κ°λ κ³Ό μ μ¬νλ€.
Prompt: "μνμμ νΌνκ³ λΌμ€μ μ 리λ"
μνμμ νΌνκ³ λΌμ€μ μ 리λ λ€μκ³Ό κ°λ€. :\operatorname{Span}_{\mathbb R} = \frac1{2\pi i}\int_0^{\infty} \frac{\sin(\tau)}{\sin(x)} ; \mathrm{Re}\bigl(\tau,\mathrm{Can}\bigr)
Limitations
- Small (195M) model trained on a single domain (Korean Wikipedia). It will hallucinate facts confidently.
- Not instruction-tuned. It continues text in Wikipedia style, not as a chatbot.
- 1,024 token context window.
- Korean-only with limited English capability.
License
MIT