---
language:
- sv
- en
- code
license: apache-2.0
tags:
- causal-lm
- llama
- swedish
- gqa
- sungpt
- chat
- instruction-tuned
pipeline_tag: text-generation
---

# sungpt-swe-410m

A 410M-parameter instruction-tuned chat model trained from scratch on Swedish text, English web text, math, and code,
then fine-tuned in two stages (chat + coding SFT).
Built with the [sungpt](https://github.com/revana/sungpt) training framework — a Llama-style architecture
(RoPE + RMSNorm + SwiGLU + GQA) with weights exported directly to `LlamaForCausalLM` for zero-friction HF compatibility.

---

## Model details

| Hyperparameter       | Value                                      |
|----------------------|--------------------------------------------|
| Architecture         | LlamaForCausalLM (RoPE + RMSNorm + SwiGLU + GQA) |
| Hidden size          | 1024                                       |
| Layers               | 24                                         |
| Attention heads      | 16                                         |
| KV heads (GQA)       | 8                                          |
| FFN intermediate     | 4096 (SwiGLU)                              |
| Max sequence length  | 4096                                       |
| Vocab size           | 32,000                                     |
| Parameters           | ~435M                                      |
| Precision            | bfloat16                                   |
| Tied embeddings      | Yes                                        |

---

## Quick start

```python
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_id = "revana/sungpt-swe-410m"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)
```

### Chat / instruction

This model uses the **Alpaca** prompt format:

```
### Instruction:
What is machine learning?

### Response:
```

```python
messages = [
    {"role": "user", "content": "What is machine learning?"},
]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)

out = model.generate(
    **inputs,
    max_new_tokens=256,
    do_sample=True,
    temperature=0.7,
    top_p=0.9,
    repetition_penalty=1.1,
)
# Decode only the newly generated assistant tokens
reply = tokenizer.decode(out[0][inputs["input_ids"].shape[-1]:], skip_special_tokens=True)
print(reply)
```

### Completion (base-style)

```python
prompts = {
    "code":    "def merge_sort(arr):\n    \"\"\"Sort a list using merge sort.\"\"\"\n",
    "math":    "To solve the equation 2x + 5 = 13, we first subtract 5 from both sides to get",
    "english": "The transformer architecture was introduced in the paper 'Attention is All You Need' and works by",
    "swedish": "Sverige ar kant for sin starka valfardsmodell och",
}

for domain, prompt in prompts.items():
    print(f"\n--- {domain} ---")
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    out = model.generate(
        **inputs,
        max_new_tokens=150,
        do_sample=True,
        temperature=0.8,
        top_p=0.95,
        repetition_penalty=1.1,
    )
    print(tokenizer.decode(out[0], skip_special_tokens=True))
```

**CPU / low-VRAM:**
```python
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.float32)
```

Default generation settings (`generation_config.json`): `temperature=0.8`, `top_p=0.95`, `top_k=50`,
`repetition_penalty=1.1`, `max_new_tokens=512` — so a bare `model.generate(**inputs)` already samples.

---

## Training

### Pretraining

| Property    | Value |
|-------------|-------|
| Framework   | [sungpt](https://github.com/revana/sungpt) (custom, Llama-style) |
| Hardware    | 1x H200 80 GB |
| Precision   | bfloat16, gradient checkpointing, `torch.compile` |
| Optimizer   | AdamW, lr 2e-4, beta=(0.9, 0.95), cosine decay |
| Batch size  | 64 sequences x 4096 tokens = ~262K tokens/step |
| Throughput  | ~48K tokens/sec at plateau |

**Pretraining data mix (~1.2B tokens):**

| Dataset | Samples | Notes |
|---------|---------|-------|
| [HuggingFaceFW/fineweb](https://huggingface.co/datasets/HuggingFaceFW/fineweb) | 200,000 | English web |
| [codeparrot/github-code](https://huggingface.co/datasets/codeparrot/github-code) | 400,000 | Code |
| [HuggingFaceFW/fineweb-edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu) | 200,000 | Educational web |
| [meta-math/MetaMathQA](https://huggingface.co/datasets/meta-math/MetaMathQA) | 395,000 | Math reasoning |

Data was pre-tokenized into memmap shards before training for maximum GPU throughput.

### Fine-tuning (SFT — 2-stage pipeline)

**Stage 1 — Chat SFT** (teaches instruction-following format):

| Property   | Value |
|------------|-------|
| Dataset    | [tatsu-lab/alpaca](https://huggingface.co/datasets/tatsu-lab/alpaca) (~52K examples) |
| Format     | Alpaca (`### Instruction / ### Response`) |
| Epochs     | 3 (~4,875 steps) |
| Batch size | 32 |
| LR         | 2e-5, cosine decay, 100 warmup steps |
| Precision  | bfloat16 |

**Stage 2 — Coding SFT** (teaches code-on-demand generation):

| Property   | Value |
|------------|-------|
| Dataset    | [theblackcat102/evol-codealpaca-v1](https://huggingface.co/datasets/theblackcat102/evol-codealpaca-v1) (~111K examples) |
| Format     | Alpaca (`### Instruction / ### Response`) |
| Epochs     | 3 (~10,406 steps) |
| Batch size | 32 |
| LR         | 1e-5, cosine decay, 100 warmup steps |
| Precision  | bfloat16 |

---

## Tokenizer

Custom BPE tokenizer (32,000 vocab) trained on Swedish + English + code text.
Special tokens: `[BOS]` (id 2), `[EOS]` (id 3), `[PAD]` (id 1).

```python
tokenizer = AutoTokenizer.from_pretrained("revana/sungpt-swe-410m")
tokens = tokenizer("Hej varlden!", return_tensors="pt")
```

---

## Limitations

- **Swedish skew** — stronger at Swedish and code than general English.
- **No RLHF / safety alignment** — outputs may be biased or inappropriate; use with care in production.
- **410M parameters** — capacity is limited; expect repetition on long contexts without `repetition_penalty`.

---

## License

Apache 2.0 — see [LICENSE](LICENSE).