AstraGPTCoder-7B / README.md
adityawakharkar's picture
Update README.md
6921286 verified
metadata
base_model: adityawakharkar/AstraGPTCoder-7B
language:
  - en
license: apache-2.0
tags:
  - from-scratch
  - custom-architecture
  - custom-tokenizer
  - reasoning
  - chain-of-thought
  - think-tags
  - coding
  - fine-tuned
  - lora
  - peft
  - unsloth
  - astragpt
  - tantra-ai-labs
  - rtx-4090
pipeline_tag: text-generation
library_name: transformers
model_creator: Tantra AI Labs

AstraGPT-7B πŸš€

A 7-Billion Parameter Language Model β€” Built From Scratch

Custom Architecture Β· Custom BPE Tokenizer Β· Reasoning Fine-Tuned on Dual RTX 4090

License Model Params GPU By

Built by Aditya Wakharkar | Tantra AI Labs


🧠 What is AstraGPT-7B?

AstraGPT-7B is a 7-billion parameter decoder-only language model designed for coding and chain-of-thought reasoning.

Unlike most open-source fine-tunes, every core component of AstraGPT was designed and implemented from scratch in PyTorch β€” including the transformer architecture, the BPE tokenizer, and the supervised fine-tuning pipeline.

The model was then fine-tuned on a reasoning dataset using LoRA on a private VPS equipped with dual NVIDIA RTX 4090 GPUs, giving it native support for <think>...</think> style reasoning output.

"Most people fine-tune models. We built one."


πŸ—οΈ Built From Scratch β€” Architecture Overview

Every layer of AstraGPT-7B was implemented from first principles in PyTorch. No AutoModel, no copy-paste β€” pure custom code.

Input Token IDs
      β”‚
      β–Ό
Token Embedding  [64,000 β†’ 4,096]
      β”‚
      β–Ό  Γ—32 Transformer Blocks
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚           AstraGPT Block            β”‚
β”‚                                     β”‚
β”‚  RMSNorm (Pre-norm)                 β”‚
β”‚  β†’ Grouped Query Attention (GQA)    β”‚
β”‚    Β· 32 Query Heads                 β”‚
β”‚    Β· 8 Key-Value Heads              β”‚
β”‚    Β· RoPE (ΞΈ = 1,000,000)           β”‚
β”‚    Β· KV Cache for inference         β”‚
β”‚  β†’ Residual Add                     β”‚
β”‚                                     β”‚
β”‚  RMSNorm (Pre-norm)                 β”‚
β”‚  β†’ SwiGLU Feed-Forward Network      β”‚
β”‚    Β· gate_proj, up_proj, down_proj  β”‚
β”‚    Β· intermediate_size = 11,008     β”‚
β”‚  β†’ Residual Add                     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
      β”‚
      β–Ό
Final RMSNorm
      β”‚
      β–Ό
LM Head  [4,096 β†’ 64,000]
      β”‚
      β–Ό
Logits β†’ Next Token

Architecture Highlights

Component Implementation Why
Grouped Query Attention (GQA) 32Q / 8KV heads β€” built from scratch 4Γ— less KV memory vs MHA. Same used in LLaMA-3, Mistral
Rotary Position Embeddings (RoPE) Full RoPE math from scratch, ΞΈ=1M Better long-context vs learned embeddings
SwiGLU FFN gate Γ— SiLU(up) through down_proj Outperforms GELU/ReLU on LM benchmarks
RMSNorm Pre-norm, no bias, no mean subtraction ~30% faster than LayerNorm
Flash Attention PyTorch 2.0 scaled_dot_product_attention Memory-efficient attention with O(n) space

Parameter Count (~7B)

Component Parameters
Token Embedding (64K Γ— 4096) ~262M
Attention Γ— 32 layers ~2.15B
SwiGLU FFN Γ— 32 layers ~4.32B
RMSNorm Γ— 65 ~267K
LM Head ~262M
Total ~7.0B

πŸ”€ Custom BPE Tokenizer β€” From Scratch

AstraGPT uses a custom Byte Pair Encoding tokenizer built entirely from scratch β€” no SentencePiece, no HuggingFace tokenizers library.

# Built from scratch
from tokenizer import BPETokenizer

tok = BPETokenizer(vocab_size=64_000)
tok.train(open("corpus.txt"), num_merges=60_000)

Tokenizer features:

  • Byte-level base vocabulary β€” 256 raw bytes, handles any Unicode
  • GPT-4 style pre-tokenization regex β€” smart word boundary splitting
  • 64,000 vocab size β€” 60K BPE merges on top of byte base
  • Built-in special tokens: <think>, </think>, <|im_start|>, <|im_end|>, BOS, EOS, PAD
  • apply_chat_template() β€” custom chat format support
  • Save/load β€” JSON-serializable merge rules

⚑ Training β€” Dual RTX 4090 on Private VPS

Fine-tuning was performed on a private Linux VPS with 2Γ— NVIDIA RTX 4090 GPUs (total 48GB VRAM).

Hardware Setup

Spec Value
GPUs 2Γ— NVIDIA RTX 4090 (24GB VRAM each)
Total VRAM 48 GB
CPU High-core count server CPU
Infrastructure Private VPS (bare metal)
OS Ubuntu 22.04 LTS
CUDA 12.x

Training Pipeline β€” Also Built From Scratch

The SFT (Supervised Fine-Tuning) training loop was implemented from scratch with production-grade features:

# Full custom training loop
trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    dataset=dataset,
    # Dual GPU via DDP
    use_bf16=True,
    grad_accumulation=8,
    learning_rate=2e-4,
    use_wandb=True,
)
trainer.train()

Training loop features:

  • βœ… Gradient accumulation β€” effective large batch training
  • βœ… Mixed precision (BF16) β€” full RTX 4090 tensor core utilization
  • βœ… Cosine LR schedule with warmup β€” smooth convergence
  • βœ… Gradient clipping β€” stable training
  • βœ… W&B logging β€” real-time loss/LR tracking
  • βœ… Checkpoint saving β€” best model tracking by loss

Fine-Tuning Hyperparameters

Parameter Value
Method LoRA (PEFT) via Unsloth
LoRA Rank 16
LoRA Alpha 32
Target Modules q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
Max Sequence Length 2,048 tokens
Effective Batch Size 16 (2 Γ— grad_accum 8)
Learning Rate 2e-4
LR Scheduler Cosine with warmup
Warmup Ratio 5%
Epochs 3
Precision BF16 mixed precision
Optimizer AdamW 8-bit

Post-Training

After fine-tuning, the LoRA adapter was merged back into base model weights β€” resulting in a single, self-contained model with no external adapter dependency.


πŸ€” Thinking / Reasoning Support

AstraGPT-7B natively generates <think> tag reasoning when triggered. This was trained in via the fine-tuning dataset, which used structured chain-of-thought formatting.

Example:

Input:

What is 15 * 47?

Output:

<think>
The multiplication involves multiplying 15 by 47.
  15 Γ— 47 = 15 Γ— 40 + 15 Γ— 7
          = 600 + 105
          = 705
</think>
705

Trigger thinking mode:

# Append this to your prompt to force reasoning
prompt = tokenizer.apply_chat_template(messages, ...) + "<think>\n"

⚑ Quick Start

Install

pip install transformers torch bitsandbytes accelerate

Basic Inference

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_id = "adityawakharkar/AstraGPT-7B"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.float16,
    device_map="auto"
)

messages = [
    {
        "role": "system",
        "content": "You are AstraGPT, a helpful coding AI built by Tantra AI Labs. Think carefully using <think>...</think> tags before answering."
    },
    {
        "role": "user",
        "content": "Write a Python function to reverse a linked list."
    }
]

prompt = tokenizer.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
) + "<think>\n"   # ← triggers reasoning

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

with torch.no_grad():
    output = model.generate(
        **inputs,
        max_new_tokens=1024,
        temperature=0.3,
        do_sample=True,
        repetition_penalty=1.1,
        pad_token_id=tokenizer.eos_token_id,
    )

response = tokenizer.decode(
    output[0][inputs["input_ids"].shape[1]:],
    skip_special_tokens=True
)
print(response)

4-bit Quantized (Runs on ~6GB VRAM)

from transformers import BitsAndBytesConfig

bnb = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=True,
)

model = AutoModelForCausalLM.from_pretrained(
    "adityawakharkar/AstraGPT-7B",
    quantization_config=bnb,
    device_map="auto"
)

πŸ“ Codebase

The full from-scratch implementation is open-source:

AstraGPT-7B-scratch/
β”œβ”€β”€ model/
β”‚   β”œβ”€β”€ config.py              ← AstraGPTConfig (7B hyperparams, 1B/3B presets)
β”‚   β”œβ”€β”€ rotary_embedding.py    ← RoPE from scratch (precompute + apply)
β”‚   β”œβ”€β”€ attention.py           ← GQA from scratch (32Q / 8KV + KV cache)
β”‚   β”œβ”€β”€ feedforward.py         ← SwiGLU + RMSNorm + TransformerBlock
β”‚   └── transformer.py         ← Full model + generate() + save/load
β”œβ”€β”€ tokenizer/
β”‚   β”œβ”€β”€ bpe_tokenizer.py       ← Full BPE tokenizer (train, encode, decode)
β”‚   └── train_tokenizer.py     ← Train on any text corpus
└── training/
    └── sft_trainer.py         ← Complete SFT loop (grad accum, bf16, cosine LR)

Bias, Risks, and Limitations

  • Hallucination: Can produce confident but incorrect answers β€” always verify
  • Math limits: Complex multi-step math may fail β€” 7B is a small model
  • English-primary: Best performance in English
  • Reasoning trigger: <think> tags work most reliably with explicit <think>\n prefix in prompt

Environmental Impact

  • Hardware: 2Γ— NVIDIA RTX 4090 (48GB combined VRAM)
  • Infrastructure: Private bare-metal VPS
  • Training Duration: ~3–4 hours
  • Carbon Emitted: Estimated ~2–3 kgCO2eq

Citation

@misc{astragpt7b2026,
  author       = {Aditya Wakharkar},
  title        = {AstraGPT-7B: A 7B LLM Built From Scratch with Chain-of-Thought Reasoning},
  year         = {2026},
  publisher    = {HuggingFace},
  organization = {Tantra AI Labs},
  url          = {https://huggingface.co/adityawakharkar/AstraGPT-7B},
  note         = {Custom architecture, custom BPE tokenizer, trained on 2Γ— RTX 4090}
}

Model Card Authors

Aditya Wakharkar β€” @adityawakharkar | GitHub @codewith-aditya

Contact


Built from scratch with ❀️ by Tantra AI Labs
Every layer. Every weight. Every line of code.