ArlowGPT Foundational
Collection
3 items β’ Updated β’ 1
Official ArlowGPT model will be vision capable. This model is a proof of concept of the text backbone of ArlowGPT.
Special transformers fork: https://github.com/yuchenxie4645/transformers/tree/ArlowVL
git clone --branch ArlowVL --single-branch https://github.com/yuchenxie4645/transformers
cd transformers
pip install -e .
import torch
from transformers import ArlowTokenizer, ArlowForCausalLM
model_path = "yuchenxie/ArlowGPT-4B-Foundational-Preview-V2"
tokenizer = ArlowTokenizer.from_pretrained(model_path)
model = ArlowForCausalLM.from_pretrained(
model_path,
torch_dtype=torch.bfloat16,
device_map="auto",
)
model.eval()
prompt = "Give it all you got "
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
outputs = model.generate(
**inputs,
max_new_tokens=128,
do_sample=True,
temperature=0.7,
top_p=0.9,
pad_token_id=tokenizer.pad_token_id,
eos_token_id=tokenizer.eos_token_id,
)
generated = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(generated)
| Parameter | Value |
|---|---|
| Architecture | ArlowText (decoder-only) |
| Parameters | ~4B |
| Hidden size | 3072 |
| Intermediate size | 8192 |
| Layers | 28 |
| Attention heads | 24 (Q) / 4 (KV, GQA) |
| Vocab size | 131,074 |
| Max position embeddings | 4,096 |
| RoPE ΞΈ | 100,000 |
| Activation | SiLU |
| Precision | bf16 |
| Attention | Flash Attention 4 |
| Parameter | Value |
|---|---|
| Dataset | yuchenxie/Arlow-Constellations |
| Config | default |
| Split | train_0 |
| Text column | text |
| Packing | Concatenate + split into 4,096-token blocks |
| Parameter | Value |
|---|---|
| Hardware | 1x B300 GPU server |
| Framework | DeepSpeed ZeRO Stage 3 |
| Micro batch size | 20 per GPU |
| Gradient accumulation | 32 |
| Global batch size | 20 Γ 32 Γ 1 = 640 sequences |
| Tokens per step | 640 Γ 4,096 = 2,621,440 tokens |
| Optimizer | AdamW (Ξ²β=0.9, Ξ²β=0.95, Ξ΅=1e-5) |
| Weight decay | 0.1 |
| Peak learning rate | 5e-5 |
| Warmup LR | 1e-6 β 5e-5 |
| Warmup steps | 953 |
| LR schedule | Linear warmup β linear decay to 0 |
| Gradient clipping | 0.25 |
| Gradient checkpointing | Enabled |
| Metric | Value |
|---|---|
| Optimizer steps completed | 6,000 |
| Epochs completed | ~1.26 / 2.00 |
| Tokens trained | ~15.73B |
| Latest loss | 4.58 |
final/ (exported HF format)
βββ model-00001-of-00002.safetensors
βββ model-00002-of-00002.safetensors
βββ model.safetensors.index.json
βββ config.json
βββ tokenizer.json, tokenizer_config.json, vocab.json, merges.txt, ...
βββ generation_config.json