THIS MODEL IS NOT OFFICIAL BUT RATHER A PROOF OF CONCEPT OF THE ARLOW TEXT ARCHITECTURE

Official ArlowGPT model will be vision capable. This model is a proof of concept of the text backbone of ArlowGPT.

This model requires a specific transformers fork to be ran as architecture code has not been merged into official ransformers yet.

Special transformers fork: https://github.com/yuchenxie4645/transformers/tree/ArlowVL (ArlowVL Branch is the latest edition)

git clone --branch ArlowVL --single-branch https://github.com/yuchenxie4645/transformers/ # One line clone.
cd transformers && pip install -e .

Example code:

import torch
from transformers import ArlowTokenizer, ArlowForCausalLM

model_path = yuchenxie/ArlowGPT-4B-Foundational-Preview-V1

tokenizer = ArlowTokenizer.from_pretrained(model_path)
model = ArlowForCausalLM.from_pretrained(
    model_path,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)
model.eval()

prompt = "Give it all you got "
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=128,
        do_sample=True,
        temperature=0.7,
        top_p=0.9,
        pad_token_id=tokenizer.pad_token_id,
        eos_token_id=tokenizer.eos_token_id,
    )

generated = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(generated)

Training details

Model Architecture

Parameter	Value
Architecture	ArlowText (decoder-only)
Parameters	~3.53B
Hidden size	3072
Intermediate size	8192
Layers	28
Attention heads	24 (Q) / 4 (KV, GQA)
Vocab size	131,072
Max position embeddings	4,096
RoPE θ	100,000
Activation	SiLU
Precision	bf16
Attention	Flash Attention 2

Dataset

Parameter	Value
Dataset	`CohereLabs/aya_collection_language_split`
Subset	`english`
Split	`train`
Text column	`inputs`
Packing	Concatenate + split into 4096-token blocks

Training Configuration

Parameter	Value
Hardware	8× NVIDIA A6000 48GB (PCIe 3.0 Interconnect)
Framework	DeepSpeed ZeRO Stage 2
Micro batch size	4 per GPU
Gradient accumulation	32
Global batch size	4 × 32 × 8 = 1,024 sequences
Tokens per step	1,024 × 4,096 = 4.19M tokens
Optimizer	AdamW (β₁=0.9, β₂=0.95, ε=1e-8)
Weight decay	0.1
Peak learning rate	3e-4
Warmup LR	3e-5 → 3e-4
Warmup steps	1,000
LR schedule	Linear warmup → linear decay to 0
Gradient clipping	1.0
Gradient checkpointing	Enabled

Training Progress (Checkpoint-454)

Metric	Value
Optimizer steps completed	454
Epochs completed	1 (full pass over token cache)
Tokens trained	~1.9B
Final loss	2.58–2.82
Final perplexity	~14–17
Throughput	~13,700 tokens/sec
Peak GPU memory	~33.7 GB per GPU
Wall time	~38.5 hours

Files Produced

final/  (exported HF format)
├── model-00001-of-00002.safetensors
├── model-00002-of-00002.safetensors
├── model.safetensors.index.json
├── config.json
├── tokenizer.json, tokenizer_config.json, vocab.json, merges.txt, ...
└── generation_config.json

Downloads last month: 13

Safetensors

Model size

4B params

Tensor type

BF16

Dataset used to train yuchenxie/ArlowGPT-4B-Foundational-Preview-V1

Collection including yuchenxie/ArlowGPT-4B-Foundational-Preview-V1

ArlowGPT Foundational

Collection

3 items • Updated 17 days ago • 1