AstraGPTCoder-7B / README.md
adityawakharkar's picture
Update README.md
6921286 verified
---
base_model: adityawakharkar/AstraGPTCoder-7B
language:
- en
license: apache-2.0
tags:
- from-scratch
- custom-architecture
- custom-tokenizer
- reasoning
- chain-of-thought
- think-tags
- coding
- fine-tuned
- lora
- peft
- unsloth
- astragpt
- tantra-ai-labs
- rtx-4090
pipeline_tag: text-generation
library_name: transformers
model_creator: Tantra AI Labs
---
# AstraGPT-7B πŸš€
<div align="center">
**A 7-Billion Parameter Language Model β€” Built From Scratch**
*Custom Architecture Β· Custom BPE Tokenizer Β· Reasoning Fine-Tuned on Dual RTX 4090*
[![License](https://img.shields.io/badge/License-Apache%202.0-green.svg)](https://opensource.org/licenses/Apache-2.0)
[![Model](https://img.shields.io/badge/HuggingFace-AstraGPT--7B-yellow)](https://huggingface.co/adityawakharkar/AstraGPT-7B)
[![Params](https://img.shields.io/badge/Parameters-7B-blue)]()
[![GPU](https://img.shields.io/badge/Trained%20On-2Γ—%20RTX%204090-76b900?logo=nvidia)](https://www.nvidia.com)
[![By](https://img.shields.io/badge/By-Tantra%20AI%20Labs-purple)](https://github.com/codewith-aditya)
Built by **Aditya Wakharkar** | [Tantra AI Labs](https://github.com/codewith-aditya)
</div>
---
## 🧠 What is AstraGPT-7B?
AstraGPT-7B is a **7-billion parameter decoder-only language model** designed for coding and chain-of-thought reasoning.
Unlike most open-source fine-tunes, **every core component of AstraGPT was designed and implemented from scratch in PyTorch** β€” including the transformer architecture, the BPE tokenizer, and the supervised fine-tuning pipeline.
The model was then **fine-tuned on a reasoning dataset** using LoRA on a **private VPS equipped with dual NVIDIA RTX 4090 GPUs**, giving it native support for `<think>...</think>` style reasoning output.
> *"Most people fine-tune models. We built one."*
---
## πŸ—οΈ Built From Scratch β€” Architecture Overview
Every layer of AstraGPT-7B was implemented from first principles in PyTorch. No `AutoModel`, no copy-paste β€” pure custom code.
```
Input Token IDs
β”‚
β–Ό
Token Embedding [64,000 β†’ 4,096]
β”‚
β–Ό Γ—32 Transformer Blocks
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ AstraGPT Block β”‚
β”‚ β”‚
β”‚ RMSNorm (Pre-norm) β”‚
β”‚ β†’ Grouped Query Attention (GQA) β”‚
β”‚ Β· 32 Query Heads β”‚
β”‚ Β· 8 Key-Value Heads β”‚
β”‚ Β· RoPE (ΞΈ = 1,000,000) β”‚
β”‚ Β· KV Cache for inference β”‚
β”‚ β†’ Residual Add β”‚
β”‚ β”‚
β”‚ RMSNorm (Pre-norm) β”‚
β”‚ β†’ SwiGLU Feed-Forward Network β”‚
β”‚ Β· gate_proj, up_proj, down_proj β”‚
β”‚ Β· intermediate_size = 11,008 β”‚
β”‚ β†’ Residual Add β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚
β–Ό
Final RMSNorm
β”‚
β–Ό
LM Head [4,096 β†’ 64,000]
β”‚
β–Ό
Logits β†’ Next Token
```
### Architecture Highlights
| Component | Implementation | Why |
|-----------|---------------|-----|
| **Grouped Query Attention (GQA)** | 32Q / 8KV heads β€” built from scratch | 4Γ— less KV memory vs MHA. Same used in LLaMA-3, Mistral |
| **Rotary Position Embeddings (RoPE)** | Full RoPE math from scratch, ΞΈ=1M | Better long-context vs learned embeddings |
| **SwiGLU FFN** | gate Γ— SiLU(up) through down_proj | Outperforms GELU/ReLU on LM benchmarks |
| **RMSNorm** | Pre-norm, no bias, no mean subtraction | ~30% faster than LayerNorm |
| **Flash Attention** | PyTorch 2.0 `scaled_dot_product_attention` | Memory-efficient attention with O(n) space |
### Parameter Count (~7B)
| Component | Parameters |
|-----------|-----------|
| Token Embedding (64K Γ— 4096) | ~262M |
| Attention Γ— 32 layers | ~2.15B |
| SwiGLU FFN Γ— 32 layers | ~4.32B |
| RMSNorm Γ— 65 | ~267K |
| LM Head | ~262M |
| **Total** | **~7.0B** |
---
## πŸ”€ Custom BPE Tokenizer β€” From Scratch
AstraGPT uses a **custom Byte Pair Encoding tokenizer** built entirely from scratch β€” no SentencePiece, no HuggingFace tokenizers library.
```python
# Built from scratch
from tokenizer import BPETokenizer
tok = BPETokenizer(vocab_size=64_000)
tok.train(open("corpus.txt"), num_merges=60_000)
```
**Tokenizer features:**
- **Byte-level base vocabulary** β€” 256 raw bytes, handles any Unicode
- **GPT-4 style pre-tokenization regex** β€” smart word boundary splitting
- **64,000 vocab size** β€” 60K BPE merges on top of byte base
- **Built-in special tokens:** `<think>`, `</think>`, `<|im_start|>`, `<|im_end|>`, BOS, EOS, PAD
- **`apply_chat_template()`** β€” custom chat format support
- **Save/load** β€” JSON-serializable merge rules
---
## ⚑ Training β€” Dual RTX 4090 on Private VPS
Fine-tuning was performed on a **private Linux VPS with 2Γ— NVIDIA RTX 4090 GPUs** (total 48GB VRAM).
### Hardware Setup
| Spec | Value |
|------|-------|
| GPUs | **2Γ— NVIDIA RTX 4090** (24GB VRAM each) |
| Total VRAM | **48 GB** |
| CPU | High-core count server CPU |
| Infrastructure | Private VPS (bare metal) |
| OS | Ubuntu 22.04 LTS |
| CUDA | 12.x |
### Training Pipeline β€” Also Built From Scratch
The SFT (Supervised Fine-Tuning) training loop was implemented from scratch with production-grade features:
```python
# Full custom training loop
trainer = SFTTrainer(
model=model,
tokenizer=tokenizer,
dataset=dataset,
# Dual GPU via DDP
use_bf16=True,
grad_accumulation=8,
learning_rate=2e-4,
use_wandb=True,
)
trainer.train()
```
**Training loop features:**
- βœ… **Gradient accumulation** β€” effective large batch training
- βœ… **Mixed precision (BF16)** β€” full RTX 4090 tensor core utilization
- βœ… **Cosine LR schedule with warmup** β€” smooth convergence
- βœ… **Gradient clipping** β€” stable training
- βœ… **W&B logging** β€” real-time loss/LR tracking
- βœ… **Checkpoint saving** β€” best model tracking by loss
### Fine-Tuning Hyperparameters
| Parameter | Value |
|-----------|-------|
| Method | LoRA (PEFT) via Unsloth |
| LoRA Rank | 16 |
| LoRA Alpha | 32 |
| Target Modules | q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj |
| Max Sequence Length | 2,048 tokens |
| Effective Batch Size | 16 (2 Γ— grad_accum 8) |
| Learning Rate | 2e-4 |
| LR Scheduler | Cosine with warmup |
| Warmup Ratio | 5% |
| Epochs | 3 |
| Precision | BF16 mixed precision |
| Optimizer | AdamW 8-bit |
### Post-Training
After fine-tuning, the LoRA adapter was **merged back into base model weights** β€” resulting in a single, self-contained model with no external adapter dependency.
---
## πŸ€” Thinking / Reasoning Support
AstraGPT-7B natively generates `<think>` tag reasoning when triggered. This was trained in via the fine-tuning dataset, which used structured chain-of-thought formatting.
**Example:**
**Input:**
```
What is 15 * 47?
```
**Output:**
```
<think>
The multiplication involves multiplying 15 by 47.
15 Γ— 47 = 15 Γ— 40 + 15 Γ— 7
= 600 + 105
= 705
</think>
705
```
**Trigger thinking mode:**
```python
# Append this to your prompt to force reasoning
prompt = tokenizer.apply_chat_template(messages, ...) + "<think>\n"
```
---
## ⚑ Quick Start
### Install
```bash
pip install transformers torch bitsandbytes accelerate
```
### Basic Inference
```python
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
model_id = "adityawakharkar/AstraGPT-7B"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.float16,
device_map="auto"
)
messages = [
{
"role": "system",
"content": "You are AstraGPT, a helpful coding AI built by Tantra AI Labs. Think carefully using <think>...</think> tags before answering."
},
{
"role": "user",
"content": "Write a Python function to reverse a linked list."
}
]
prompt = tokenizer.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
) + "<think>\n" # ← triggers reasoning
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
output = model.generate(
**inputs,
max_new_tokens=1024,
temperature=0.3,
do_sample=True,
repetition_penalty=1.1,
pad_token_id=tokenizer.eos_token_id,
)
response = tokenizer.decode(
output[0][inputs["input_ids"].shape[1]:],
skip_special_tokens=True
)
print(response)
```
### 4-bit Quantized (Runs on ~6GB VRAM)
```python
from transformers import BitsAndBytesConfig
bnb = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
)
model = AutoModelForCausalLM.from_pretrained(
"adityawakharkar/AstraGPT-7B",
quantization_config=bnb,
device_map="auto"
)
```
---
## πŸ“ Codebase
The full from-scratch implementation is open-source:
```
AstraGPT-7B-scratch/
β”œβ”€β”€ model/
β”‚ β”œβ”€β”€ config.py ← AstraGPTConfig (7B hyperparams, 1B/3B presets)
β”‚ β”œβ”€β”€ rotary_embedding.py ← RoPE from scratch (precompute + apply)
β”‚ β”œβ”€β”€ attention.py ← GQA from scratch (32Q / 8KV + KV cache)
β”‚ β”œβ”€β”€ feedforward.py ← SwiGLU + RMSNorm + TransformerBlock
β”‚ └── transformer.py ← Full model + generate() + save/load
β”œβ”€β”€ tokenizer/
β”‚ β”œβ”€β”€ bpe_tokenizer.py ← Full BPE tokenizer (train, encode, decode)
β”‚ └── train_tokenizer.py ← Train on any text corpus
└── training/
└── sft_trainer.py ← Complete SFT loop (grad accum, bf16, cosine LR)
```
---
## Bias, Risks, and Limitations
- **Hallucination:** Can produce confident but incorrect answers β€” always verify
- **Math limits:** Complex multi-step math may fail β€” 7B is a small model
- **English-primary:** Best performance in English
- **Reasoning trigger:** `<think>` tags work most reliably with explicit `<think>\n` prefix in prompt
---
## Environmental Impact
- **Hardware:** 2Γ— NVIDIA RTX 4090 (48GB combined VRAM)
- **Infrastructure:** Private bare-metal VPS
- **Training Duration:** ~3–4 hours
- **Carbon Emitted:** Estimated ~2–3 kgCO2eq
---
## Citation
```bibtex
@misc{astragpt7b2026,
author = {Aditya Wakharkar},
title = {AstraGPT-7B: A 7B LLM Built From Scratch with Chain-of-Thought Reasoning},
year = {2026},
publisher = {HuggingFace},
organization = {Tantra AI Labs},
url = {https://huggingface.co/adityawakharkar/AstraGPT-7B},
note = {Custom architecture, custom BPE tokenizer, trained on 2Γ— RTX 4090}
}
```
---
## Model Card Authors
**Aditya Wakharkar** β€” [@adityawakharkar](https://huggingface.co/adityawakharkar) | [GitHub @codewith-aditya](https://github.com/codewith-aditya)
## Contact
- πŸ™ GitHub: [github.com/codewith-aditya](https://github.com/codewith-aditya)
- πŸ€— HuggingFace: [@adityawakharkar](https://huggingface.co/adityawakharkar)
---
<div align="center">
<em>Built from scratch with ❀️ by <strong>Tantra AI Labs</strong></em><br/>
<em>Every layer. Every weight. Every line of code.</em>
</div>