ArlowGPT Foundational
Collection
3 items β’ Updated β’ 1
Official ArlowGPT model will be vision capable. This model is a proof of concept of the text backbone of ArlowGPT.
Special transformers fork: https://github.com/yuchenxie4645/transformers/tree/ArlowVL (ArlowVL Branch is the latest edition)
git clone --branch ArlowVL --single-branch https://github.com/yuchenxie4645/transformers/ # One line clone.
cd transformers && pip install -e .
import torch
from transformers import ArlowTokenizer, ArlowForCausalLM
model_path = yuchenxie/ArlowGPT-4B-Foundational-Preview-V1
tokenizer = ArlowTokenizer.from_pretrained(model_path)
model = ArlowForCausalLM.from_pretrained(
model_path,
torch_dtype=torch.bfloat16,
device_map="auto",
)
model.eval()
prompt = "Give it all you got "
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
outputs = model.generate(
**inputs,
max_new_tokens=128,
do_sample=True,
temperature=0.7,
top_p=0.9,
pad_token_id=tokenizer.pad_token_id,
eos_token_id=tokenizer.eos_token_id,
)
generated = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(generated)
| Parameter | Value |
|---|---|
| Architecture | ArlowText (decoder-only) |
| Parameters | ~3.53B |
| Hidden size | 3072 |
| Intermediate size | 8192 |
| Layers | 28 |
| Attention heads | 24 (Q) / 4 (KV, GQA) |
| Vocab size | 131,072 |
| Max position embeddings | 4,096 |
| RoPE ΞΈ | 100,000 |
| Activation | SiLU |
| Precision | bf16 |
| Attention | Flash Attention 2 |
| Parameter | Value |
|---|---|
| Dataset | CohereLabs/aya_collection_language_split |
| Subset | english |
| Split | train |
| Text column | inputs |
| Packing | Concatenate + split into 4096-token blocks |
| Parameter | Value |
|---|---|
| Hardware | 8Γ NVIDIA A6000 48GB (PCIe 3.0 Interconnect) |
| Framework | DeepSpeed ZeRO Stage 2 |
| Micro batch size | 4 per GPU |
| Gradient accumulation | 32 |
| Global batch size | 4 Γ 32 Γ 8 = 1,024 sequences |
| Tokens per step | 1,024 Γ 4,096 = 4.19M tokens |
| Optimizer | AdamW (Ξ²β=0.9, Ξ²β=0.95, Ξ΅=1e-8) |
| Weight decay | 0.1 |
| Peak learning rate | 3e-4 |
| Warmup LR | 3e-5 β 3e-4 |
| Warmup steps | 1,000 |
| LR schedule | Linear warmup β linear decay to 0 |
| Gradient clipping | 1.0 |
| Gradient checkpointing | Enabled |
| Metric | Value |
|---|---|
| Optimizer steps completed | 454 |
| Epochs completed | 1 (full pass over token cache) |
| Tokens trained | ~1.9B |
| Final loss | 2.58β2.82 |
| Final perplexity | ~14β17 |
| Throughput | ~13,700 tokens/sec |
| Peak GPU memory | ~33.7 GB per GPU |
| Wall time | ~38.5 hours |
final/ (exported HF format)
βββ model-00001-of-00002.safetensors
βββ model-00002-of-00002.safetensors
βββ model.safetensors.index.json
βββ config.json
βββ tokenizer.json, tokenizer_config.json, vocab.json, merges.txt, ...
βββ generation_config.json