THIS MODEL IS NOT OFFICIAL BUT RATHER A PROOF OF CONCEPT OF THE ARLOW TEXT ARCHITECTURE

Official ArlowGPT model will be vision capable. This model is a proof of concept of the text backbone of ArlowGPT.

This model requires a specific Transformers fork because the architecture code has not been merged into official Transformers yet.

Special transformers fork: https://github.com/yuchenxie4645/transformers/tree/ArlowVL

git clone --branch ArlowVL --single-branch https://github.com/yuchenxie4645/transformers
cd transformers
pip install -e .

Example code

import torch
from transformers import ArlowTokenizer, ArlowForCausalLM

model_path = "yuchenxie/ArlowGPT-4B-Foundational-Preview-V2"

tokenizer = ArlowTokenizer.from_pretrained(model_path)
model = ArlowForCausalLM.from_pretrained(
    model_path,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)
model.eval()

prompt = "Give it all you got "
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=128,
        do_sample=True,
        temperature=0.7,
        top_p=0.9,
        pad_token_id=tokenizer.pad_token_id,
        eos_token_id=tokenizer.eos_token_id,
    )

generated = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(generated)

Training details

Model Architecture

Parameter	Value
Architecture	ArlowText (decoder-only)
Parameters	~4B
Hidden size	3072
Intermediate size	8192
Layers	28
Attention heads	24 (Q) / 4 (KV, GQA)
Vocab size	131,074
Max position embeddings	4,096
RoPE θ	100,000
Activation	SiLU
Precision	bf16
Attention	Flash Attention 4

Dataset

Parameter	Value
Dataset	`yuchenxie/Arlow-Constellations`
Config	`default`
Split	`train_0`
Text column	`text`
Packing	Concatenate + split into 4,096-token blocks

Training Configuration

Parameter	Value
Hardware	1x B300 GPU server
Framework	DeepSpeed ZeRO Stage 3
Micro batch size	20 per GPU
Gradient accumulation	32
Global batch size	20 × 32 × 1 = 640 sequences
Tokens per step	640 × 4,096 = 2,621,440 tokens
Optimizer	AdamW (β₁=0.9, β₂=0.95, ε=1e-5)
Weight decay	0.1
Peak learning rate	5e-5
Warmup LR	1e-6 → 5e-5
Warmup steps	953
LR schedule	Linear warmup → linear decay to 0
Gradient clipping	0.25
Gradient checkpointing	Enabled

Training Progress (Step 6,000 / 9,525)

Metric	Value
Optimizer steps completed	6,000
Epochs completed	~1.26 / 2.00
Tokens trained	~15.73B
Latest loss	4.58

Files Produced

final/  (exported HF format)
├── model-00001-of-00002.safetensors
├── model-00002-of-00002.safetensors
├── model.safetensors.index.json
├── config.json
├── tokenizer.json, tokenizer_config.json, vocab.json, merges.txt, ...
└── generation_config.json

Downloads last month: 8

Safetensors

Model size

4B params

Tensor type

BF16

Dataset used to train yuchenxie/ArlowGPT-4B-Foundational-Preview-V2

Collection including yuchenxie/ArlowGPT-4B-Foundational-Preview-V2

ArlowGPT Foundational

Collection

3 items • Updated 16 days ago • 1