korean-wiki-lm-195m

A 195M parameter decoder-only Transformer language model trained entirely from scratch on Korean Wikipedia (kowiki). Both the byte-level BPE tokenizer and the model architecture are written from scratch in PyTorch.

Model Details


Parameters	194,482,944 (unique, with weight tying)
Architecture	LLaMA-style decoder-only Transformer
Vocab size	32,000 (byte-level BPE)
Context length	1,024 tokens
Hidden dim	768
Layers	24
Attention heads	12
FFN dim	2,048 (SwiGLU)
Normalization	RMSNorm (pre-norm)
Position encoding	Rotary Position Embeddings (RoPE)
Weight tying	Embedding ↔ LM head

Training


Data	Korean Wikipedia dump (March 2026), 502,127 articles
Clean text	~1.12 GB across 12 shards
Tokens	230,722,476
Epochs	5 (~1.15B tokens processed)
Optimizer	AdamW (lr=3e-4, cosine decay to 3e-5, warmup 2000 steps)
Batch size	128 sequences (16 micro-batch × 8 accumulation steps)
Precision	bf16 mixed precision
Hardware	Single NVIDIA RTX 4080 (16 GB VRAM)
Training time	~12 hours
Final loss	~2.97

Loss Curve

Step     0 (epoch 0): 10.53
Step   500 (epoch 0):  6.71
Step  1000 (epoch 0):  5.48
Step  2000 (epoch 1):  4.08
Step  3000 (epoch 1):  3.67
Step  4000 (epoch 2):  3.41
Step  5000 (epoch 2):  3.31
Step  6000 (epoch 3):  3.12
Step  7000 (epoch 3):  3.10
Step  8000 (epoch 4):  2.99
Step  8800 (epoch 4):  2.97

Tokenizer

Byte-level BPE tokenizer trained from scratch on 100 MB of Korean Wikipedia text. Starts from 256 byte-level tokens + 3 special tokens, learns 31,741 merges to reach a vocabulary of 32,000 tokens.

Pre-tokenization regex (Korean-focused):

 ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+

Usage

Standalone PyTorch model, no HuggingFace Transformers dependency needed:

import torch
from config import Config
from model import KoGPT
from tokenizer import BPETokenizer
from generate import generate

cfg = Config()
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

tokenizer = BPETokenizer(cfg)
tokenizer.load(".")

model = KoGPT(cfg).to(device)
ckpt = torch.load("model.pt", map_location=device, weights_only=False)
model.load_state_dict(ckpt["model"])
model.eval()

output = generate(
    model, tokenizer,
    prompt="한국의 역사는",
    max_tokens=200,
    temperature=0.8,
    top_k=50,
    top_p=0.9,
    device=device,
)
print(output)

Example Outputs

These are outputs generated with temperature=0.8, top_k=50, top_p=0.9, seed=42. Prompt: "서울특별시는"

서울특별시는 '서울특별시에 속한 대한민국 제2도시인 서울시를 중심으로 한 광역자치단체'이다. 역사 1950년 8월 1일에 서울시가 대한민국 제2도시로 승격하면서, 서울특별시(서울)와 경기도(서울)의 경계선을 따라서 설치되었다. 그 후, 1989년 1월 1일에 서울특별시로 편입된 경기도에서 분리되었다.

Prompt: "인공지능은"

인공지능은 지능이 아닌, 인공 지능(AI)을 사용하여 인간의 능력을 향상시키는 기술이다. 인공 지능은 인간의 능력에 대한 지식과 지능을 모두 포함한다. AI는 인공지능의 하위 분야로 간주될 수 있다. 인공지능의 적용은 인간 지능의 개념과 유사하다.

Prompt: "수학에서 피타고라스의 정리는"

수학에서 피타고라스의 정리는 다음과 같다. :\operatorname{Span}_{\mathbb R} = \frac1{2\pi i}\int_0^{\infty} \frac{\sin(\tau)}{\sin(x)} ; \mathrm{Re}\bigl(\tau,\mathrm{Can}\bigr)

Limitations

Small (195M) model trained on a single domain (Korean Wikipedia). It will hallucinate facts confidently.
Not instruction-tuned. It continues text in Wikipedia style, not as a chatbot.
1,024 token context window.
Korean-only with limited English capability.

License

MIT

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support