korean-wiki-lm-195m

A 195M parameter decoder-only Transformer language model trained entirely from scratch on Korean Wikipedia (kowiki). Both the byte-level BPE tokenizer and the model architecture are written from scratch in PyTorch.

Model Details

Parameters 194,482,944 (unique, with weight tying)
Architecture LLaMA-style decoder-only Transformer
Vocab size 32,000 (byte-level BPE)
Context length 1,024 tokens
Hidden dim 768
Layers 24
Attention heads 12
FFN dim 2,048 (SwiGLU)
Normalization RMSNorm (pre-norm)
Position encoding Rotary Position Embeddings (RoPE)
Weight tying Embedding ↔ LM head

Training

Data Korean Wikipedia dump (March 2026), 502,127 articles
Clean text ~1.12 GB across 12 shards
Tokens 230,722,476
Epochs 5 (~1.15B tokens processed)
Optimizer AdamW (lr=3e-4, cosine decay to 3e-5, warmup 2000 steps)
Batch size 128 sequences (16 micro-batch Γ— 8 accumulation steps)
Precision bf16 mixed precision
Hardware Single NVIDIA RTX 4080 (16 GB VRAM)
Training time ~12 hours
Final loss ~2.97

Loss Curve

Step     0 (epoch 0): 10.53
Step   500 (epoch 0):  6.71
Step  1000 (epoch 0):  5.48
Step  2000 (epoch 1):  4.08
Step  3000 (epoch 1):  3.67
Step  4000 (epoch 2):  3.41
Step  5000 (epoch 2):  3.31
Step  6000 (epoch 3):  3.12
Step  7000 (epoch 3):  3.10
Step  8000 (epoch 4):  2.99
Step  8800 (epoch 4):  2.97

Tokenizer

Byte-level BPE tokenizer trained from scratch on 100 MB of Korean Wikipedia text. Starts from 256 byte-level tokens + 3 special tokens, learns 31,741 merges to reach a vocabulary of 32,000 tokens.

Pre-tokenization regex (Korean-focused):

 ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+

Usage

Standalone PyTorch model, no HuggingFace Transformers dependency needed:

import torch
from config import Config
from model import KoGPT
from tokenizer import BPETokenizer
from generate import generate

cfg = Config()
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

tokenizer = BPETokenizer(cfg)
tokenizer.load(".")

model = KoGPT(cfg).to(device)
ckpt = torch.load("model.pt", map_location=device, weights_only=False)
model.load_state_dict(ckpt["model"])
model.eval()

output = generate(
    model, tokenizer,
    prompt="ν•œκ΅­μ˜ μ—­μ‚¬λŠ”",
    max_tokens=200,
    temperature=0.8,
    top_k=50,
    top_p=0.9,
    device=device,
)
print(output)

Example Outputs

These are outputs generated with temperature=0.8, top_k=50, top_p=0.9, seed=42. Prompt: "μ„œμšΈνŠΉλ³„μ‹œλŠ”"

μ„œμšΈνŠΉλ³„μ‹œλŠ” 'μ„œμšΈνŠΉλ³„μ‹œμ— μ†ν•œ λŒ€ν•œλ―Όκ΅­ 제2λ„μ‹œμΈ μ„œμšΈμ‹œλ₯Ό μ€‘μ‹¬μœΌλ‘œ ν•œ κ΄‘μ—­μžμΉ˜λ‹¨μ²΄'이닀. 역사 1950λ…„ 8μ›” 1일에 μ„œμšΈμ‹œκ°€ λŒ€ν•œλ―Όκ΅­ 제2λ„μ‹œλ‘œ μŠΉκ²©ν•˜λ©΄μ„œ, μ„œμšΈνŠΉλ³„μ‹œ(μ„œμšΈ)와 경기도(μ„œμšΈ)의 경계선을 λ”°λΌμ„œ μ„€μΉ˜λ˜μ—ˆλ‹€. κ·Έ ν›„, 1989λ…„ 1μ›” 1일에 μ„œμšΈνŠΉλ³„μ‹œλ‘œ νŽΈμž…λœ κ²½κΈ°λ„μ—μ„œ λΆ„λ¦¬λ˜μ—ˆλ‹€.

Prompt: "인곡지λŠ₯은"

인곡지λŠ₯은 μ§€λŠ₯이 μ•„λ‹Œ, 인곡 μ§€λŠ₯(AI)을 μ‚¬μš©ν•˜μ—¬ μΈκ°„μ˜ λŠ₯λ ₯을 ν–₯μƒμ‹œν‚€λŠ” κΈ°μˆ μ΄λ‹€. 인곡 μ§€λŠ₯은 μΈκ°„μ˜ λŠ₯λ ₯에 λŒ€ν•œ 지식과 μ§€λŠ₯을 λͺ¨λ‘ ν¬ν•¨ν•œλ‹€. AIλŠ” 인곡지λŠ₯의 ν•˜μœ„ λΆ„μ•Όλ‘œ 간주될 수 μžˆλ‹€. 인곡지λŠ₯의 μ μš©μ€ 인간 μ§€λŠ₯의 κ°œλ…κ³Ό μœ μ‚¬ν•˜λ‹€.

Prompt: "μˆ˜ν•™μ—μ„œ ν”Όνƒ€κ³ λΌμŠ€μ˜ μ •λ¦¬λŠ”"

μˆ˜ν•™μ—μ„œ ν”Όνƒ€κ³ λΌμŠ€μ˜ μ •λ¦¬λŠ” λ‹€μŒκ³Ό κ°™λ‹€. :\operatorname{Span}_{\mathbb R} = \frac1{2\pi i}\int_0^{\infty} \frac{\sin(\tau)}{\sin(x)} ; \mathrm{Re}\bigl(\tau,\mathrm{Can}\bigr)

Limitations

  • Small (195M) model trained on a single domain (Korean Wikipedia). It will hallucinate facts confidently.
  • Not instruction-tuned. It continues text in Wikipedia style, not as a chatbot.
  • 1,024 token context window.
  • Korean-only with limited English capability.

License

MIT

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support