THIS MODEL IS NOT OFFICIAL BUT RATHER A PROOF OF CONCEPT OF THE ARLOW TEXT ARCHITECTURE

Official ArlowGPT model will be vision capable. This model is a proof of concept of the text backbone of ArlowGPT.

This model requires a specific transformers fork to be ran as architecture code has not been merged into official ransformers yet.

Special transformers fork: https://github.com/yuchenxie4645/transformers/tree/ArlowVL (ArlowVL Branch is the latest edition)

git clone --branch ArlowVL --single-branch https://github.com/yuchenxie4645/transformers/ # One line clone.
cd transformers && pip install -e .

Example code:

import torch
from transformers import ArlowTokenizer, ArlowForCausalLM

model_path = yuchenxie/ArlowGPT-4B-Foundational-Preview-V1

tokenizer = ArlowTokenizer.from_pretrained(model_path)
model = ArlowForCausalLM.from_pretrained(
    model_path,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)
model.eval()

prompt = "Give it all you got "
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=128,
        do_sample=True,
        temperature=0.7,
        top_p=0.9,
        pad_token_id=tokenizer.pad_token_id,
        eos_token_id=tokenizer.eos_token_id,
    )

generated = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(generated)

Training details

Model Architecture

Parameter Value
Architecture ArlowText (decoder-only)
Parameters ~3.53B
Hidden size 3072
Intermediate size 8192
Layers 28
Attention heads 24 (Q) / 4 (KV, GQA)
Vocab size 131,072
Max position embeddings 4,096
RoPE ΞΈ 100,000
Activation SiLU
Precision bf16
Attention Flash Attention 2

Dataset

Parameter Value
Dataset CohereLabs/aya_collection_language_split
Subset english
Split train
Text column inputs
Packing Concatenate + split into 4096-token blocks

Training Configuration

Parameter Value
Hardware 8Γ— NVIDIA A6000 48GB (PCIe 3.0 Interconnect)
Framework DeepSpeed ZeRO Stage 2
Micro batch size 4 per GPU
Gradient accumulation 32
Global batch size 4 Γ— 32 Γ— 8 = 1,024 sequences
Tokens per step 1,024 Γ— 4,096 = 4.19M tokens
Optimizer AdamW (β₁=0.9, Ξ²β‚‚=0.95, Ξ΅=1e-8)
Weight decay 0.1
Peak learning rate 3e-4
Warmup LR 3e-5 β†’ 3e-4
Warmup steps 1,000
LR schedule Linear warmup β†’ linear decay to 0
Gradient clipping 1.0
Gradient checkpointing Enabled

Training Progress (Checkpoint-454)

Metric Value
Optimizer steps completed 454
Epochs completed 1 (full pass over token cache)
Tokens trained ~1.9B
Final loss 2.58–2.82
Final perplexity ~14–17
Throughput ~13,700 tokens/sec
Peak GPU memory ~33.7 GB per GPU
Wall time ~38.5 hours

Files Produced

final/  (exported HF format)
β”œβ”€β”€ model-00001-of-00002.safetensors
β”œβ”€β”€ model-00002-of-00002.safetensors
β”œβ”€β”€ model.safetensors.index.json
β”œβ”€β”€ config.json
β”œβ”€β”€ tokenizer.json, tokenizer_config.json, vocab.json, merges.txt, ...
└── generation_config.json
Downloads last month
13
Safetensors
Model size
4B params
Tensor type
BF16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Dataset used to train yuchenxie/ArlowGPT-4B-Foundational-Preview-V1

Collection including yuchenxie/ArlowGPT-4B-Foundational-Preview-V1