THIS MODEL IS NOT OFFICIAL BUT RATHER A PROOF OF CONCEPT OF THE ARLOW TEXT ARCHITECTURE

Official ArlowGPT model will be vision capable. This model is a proof of concept of the text backbone of ArlowGPT.

This model requires a specific Transformers fork because the architecture code has not been merged into official Transformers yet.

Special transformers fork: https://github.com/yuchenxie4645/transformers/tree/ArlowVL

git clone --branch ArlowVL --single-branch https://github.com/yuchenxie4645/transformers
cd transformers
pip install -e .

Example code

import torch
from transformers import ArlowTokenizer, ArlowForCausalLM

model_path = "yuchenxie/ArlowGPT-4B-Foundational-Preview-V2"

tokenizer = ArlowTokenizer.from_pretrained(model_path)
model = ArlowForCausalLM.from_pretrained(
    model_path,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)
model.eval()

prompt = "Give it all you got "
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=128,
        do_sample=True,
        temperature=0.7,
        top_p=0.9,
        pad_token_id=tokenizer.pad_token_id,
        eos_token_id=tokenizer.eos_token_id,
    )

generated = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(generated)

Training details

Model Architecture

Parameter Value
Architecture ArlowText (decoder-only)
Parameters ~4B
Hidden size 3072
Intermediate size 8192
Layers 28
Attention heads 24 (Q) / 4 (KV, GQA)
Vocab size 131,074
Max position embeddings 4,096
RoPE ΞΈ 100,000
Activation SiLU
Precision bf16
Attention Flash Attention 4

Dataset

Parameter Value
Dataset yuchenxie/Arlow-Constellations
Config default
Split train_0
Text column text
Packing Concatenate + split into 4,096-token blocks

Training Configuration

Parameter Value
Hardware 1x B300 GPU server
Framework DeepSpeed ZeRO Stage 3
Micro batch size 20 per GPU
Gradient accumulation 32
Global batch size 20 Γ— 32 Γ— 1 = 640 sequences
Tokens per step 640 Γ— 4,096 = 2,621,440 tokens
Optimizer AdamW (β₁=0.9, Ξ²β‚‚=0.95, Ξ΅=1e-5)
Weight decay 0.1
Peak learning rate 5e-5
Warmup LR 1e-6 β†’ 5e-5
Warmup steps 953
LR schedule Linear warmup β†’ linear decay to 0
Gradient clipping 0.25
Gradient checkpointing Enabled

Training Progress (Step 6,000 / 9,525)

Metric Value
Optimizer steps completed 6,000
Epochs completed ~1.26 / 2.00
Tokens trained ~15.73B
Latest loss 4.58

Files Produced

final/  (exported HF format)
β”œβ”€β”€ model-00001-of-00002.safetensors
β”œβ”€β”€ model-00002-of-00002.safetensors
β”œβ”€β”€ model.safetensors.index.json
β”œβ”€β”€ config.json
β”œβ”€β”€ tokenizer.json, tokenizer_config.json, vocab.json, merges.txt, ...
└── generation_config.json
Downloads last month
8
Safetensors
Model size
4B params
Tensor type
BF16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Dataset used to train yuchenxie/ArlowGPT-4B-Foundational-Preview-V2

Collection including yuchenxie/ArlowGPT-4B-Foundational-Preview-V2