--- language: - sv - en - code license: apache-2.0 tags: - causal-lm - llama - swedish - gqa - sungpt - chat - instruction-tuned pipeline_tag: text-generation --- # sungpt-swe-410m A 410M-parameter instruction-tuned chat model trained from scratch on Swedish text, English web text, math, and code, then fine-tuned in two stages (chat + coding SFT). Built with the [sungpt](https://github.com/revana/sungpt) training framework — a Llama-style architecture (RoPE + RMSNorm + SwiGLU + GQA) with weights exported directly to `LlamaForCausalLM` for zero-friction HF compatibility. --- ## Model details | Hyperparameter | Value | |----------------------|--------------------------------------------| | Architecture | LlamaForCausalLM (RoPE + RMSNorm + SwiGLU + GQA) | | Hidden size | 1024 | | Layers | 24 | | Attention heads | 16 | | KV heads (GQA) | 8 | | FFN intermediate | 4096 (SwiGLU) | | Max sequence length | 4096 | | Vocab size | 32,000 | | Parameters | ~435M | | Precision | bfloat16 | | Tied embeddings | Yes | --- ## Quick start ```python from transformers import AutoTokenizer, AutoModelForCausalLM import torch model_id = "revana/sungpt-swe-410m" tokenizer = AutoTokenizer.from_pretrained(model_id) model = AutoModelForCausalLM.from_pretrained( model_id, torch_dtype=torch.bfloat16, device_map="auto", ) ``` ### Chat / instruction This model uses the **Alpaca** prompt format: ``` ### Instruction: What is machine learning? ### Response: ``` ```python messages = [ {"role": "user", "content": "What is machine learning?"}, ] text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) inputs = tokenizer(text, return_tensors="pt").to(model.device) out = model.generate( **inputs, max_new_tokens=256, do_sample=True, temperature=0.7, top_p=0.9, repetition_penalty=1.1, ) # Decode only the newly generated assistant tokens reply = tokenizer.decode(out[0][inputs["input_ids"].shape[-1]:], skip_special_tokens=True) print(reply) ``` ### Completion (base-style) ```python prompts = { "code": "def merge_sort(arr):\n \"\"\"Sort a list using merge sort.\"\"\"\n", "math": "To solve the equation 2x + 5 = 13, we first subtract 5 from both sides to get", "english": "The transformer architecture was introduced in the paper 'Attention is All You Need' and works by", "swedish": "Sverige ar kant for sin starka valfardsmodell och", } for domain, prompt in prompts.items(): print(f"\n--- {domain} ---") inputs = tokenizer(prompt, return_tensors="pt").to(model.device) out = model.generate( **inputs, max_new_tokens=150, do_sample=True, temperature=0.8, top_p=0.95, repetition_penalty=1.1, ) print(tokenizer.decode(out[0], skip_special_tokens=True)) ``` **CPU / low-VRAM:** ```python model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.float32) ``` Default generation settings (`generation_config.json`): `temperature=0.8`, `top_p=0.95`, `top_k=50`, `repetition_penalty=1.1`, `max_new_tokens=512` — so a bare `model.generate(**inputs)` already samples. --- ## Training ### Pretraining | Property | Value | |-------------|-------| | Framework | [sungpt](https://github.com/revana/sungpt) (custom, Llama-style) | | Hardware | 1x H200 80 GB | | Precision | bfloat16, gradient checkpointing, `torch.compile` | | Optimizer | AdamW, lr 2e-4, beta=(0.9, 0.95), cosine decay | | Batch size | 64 sequences x 4096 tokens = ~262K tokens/step | | Throughput | ~48K tokens/sec at plateau | **Pretraining data mix (~1.2B tokens):** | Dataset | Samples | Notes | |---------|---------|-------| | [HuggingFaceFW/fineweb](https://huggingface.co/datasets/HuggingFaceFW/fineweb) | 200,000 | English web | | [codeparrot/github-code](https://huggingface.co/datasets/codeparrot/github-code) | 400,000 | Code | | [HuggingFaceFW/fineweb-edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu) | 200,000 | Educational web | | [meta-math/MetaMathQA](https://huggingface.co/datasets/meta-math/MetaMathQA) | 395,000 | Math reasoning | Data was pre-tokenized into memmap shards before training for maximum GPU throughput. ### Fine-tuning (SFT — 2-stage pipeline) **Stage 1 — Chat SFT** (teaches instruction-following format): | Property | Value | |------------|-------| | Dataset | [tatsu-lab/alpaca](https://huggingface.co/datasets/tatsu-lab/alpaca) (~52K examples) | | Format | Alpaca (`### Instruction / ### Response`) | | Epochs | 3 (~4,875 steps) | | Batch size | 32 | | LR | 2e-5, cosine decay, 100 warmup steps | | Precision | bfloat16 | **Stage 2 — Coding SFT** (teaches code-on-demand generation): | Property | Value | |------------|-------| | Dataset | [theblackcat102/evol-codealpaca-v1](https://huggingface.co/datasets/theblackcat102/evol-codealpaca-v1) (~111K examples) | | Format | Alpaca (`### Instruction / ### Response`) | | Epochs | 3 (~10,406 steps) | | Batch size | 32 | | LR | 1e-5, cosine decay, 100 warmup steps | | Precision | bfloat16 | --- ## Tokenizer Custom BPE tokenizer (32,000 vocab) trained on Swedish + English + code text. Special tokens: `[BOS]` (id 2), `[EOS]` (id 3), `[PAD]` (id 1). ```python tokenizer = AutoTokenizer.from_pretrained("revana/sungpt-swe-410m") tokens = tokenizer("Hej varlden!", return_tensors="pt") ``` --- ## Limitations - **Swedish skew** — stronger at Swedish and code than general English. - **No RLHF / safety alignment** — outputs may be biased or inappropriate; use with care in production. - **410M parameters** — capacity is limited; expect repetition on long contexts without `repetition_penalty`. --- ## License Apache 2.0 — see [LICENSE](LICENSE).