Instructions to use adityawakharkar/AstraGPTCoder-7B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use adityawakharkar/AstraGPTCoder-7B with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="adityawakharkar/AstraGPTCoder-7B") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("adityawakharkar/AstraGPTCoder-7B") model = AutoModelForCausalLM.from_pretrained("adityawakharkar/AstraGPTCoder-7B") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - PEFT
How to use adityawakharkar/AstraGPTCoder-7B with PEFT:
Task type is invalid.
- Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use adityawakharkar/AstraGPTCoder-7B with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "adityawakharkar/AstraGPTCoder-7B" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "adityawakharkar/AstraGPTCoder-7B", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/adityawakharkar/AstraGPTCoder-7B
- SGLang
How to use adityawakharkar/AstraGPTCoder-7B with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "adityawakharkar/AstraGPTCoder-7B" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "adityawakharkar/AstraGPTCoder-7B", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "adityawakharkar/AstraGPTCoder-7B" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "adityawakharkar/AstraGPTCoder-7B", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Unsloth Studio new
How to use adityawakharkar/AstraGPTCoder-7B with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for adityawakharkar/AstraGPTCoder-7B to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for adityawakharkar/AstraGPTCoder-7B to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for adityawakharkar/AstraGPTCoder-7B to start chatting
Load model with FastModel
pip install unsloth from unsloth import FastModel model, tokenizer = FastModel.from_pretrained( model_name="adityawakharkar/AstraGPTCoder-7B", max_seq_length=2048, ) - Docker Model Runner
How to use adityawakharkar/AstraGPTCoder-7B with Docker Model Runner:
docker model run hf.co/adityawakharkar/AstraGPTCoder-7B
base_model: adityawakharkar/AstraGPTCoder-7B
language:
- en
license: apache-2.0
tags:
- from-scratch
- custom-architecture
- custom-tokenizer
- reasoning
- chain-of-thought
- think-tags
- coding
- fine-tuned
- lora
- peft
- unsloth
- astragpt
- tantra-ai-labs
- rtx-4090
pipeline_tag: text-generation
library_name: transformers
model_creator: Tantra AI Labs
AstraGPT-7B π
A 7-Billion Parameter Language Model β Built From Scratch
Custom Architecture Β· Custom BPE Tokenizer Β· Reasoning Fine-Tuned on Dual RTX 4090
Built by Aditya Wakharkar | Tantra AI Labs
π§ What is AstraGPT-7B?
AstraGPT-7B is a 7-billion parameter decoder-only language model designed for coding and chain-of-thought reasoning.
Unlike most open-source fine-tunes, every core component of AstraGPT was designed and implemented from scratch in PyTorch β including the transformer architecture, the BPE tokenizer, and the supervised fine-tuning pipeline.
The model was then fine-tuned on a reasoning dataset using LoRA on a private VPS equipped with dual NVIDIA RTX 4090 GPUs, giving it native support for <think>...</think> style reasoning output.
"Most people fine-tune models. We built one."
ποΈ Built From Scratch β Architecture Overview
Every layer of AstraGPT-7B was implemented from first principles in PyTorch. No AutoModel, no copy-paste β pure custom code.
Input Token IDs
β
βΌ
Token Embedding [64,000 β 4,096]
β
βΌ Γ32 Transformer Blocks
βββββββββββββββββββββββββββββββββββββββ
β AstraGPT Block β
β β
β RMSNorm (Pre-norm) β
β β Grouped Query Attention (GQA) β
β Β· 32 Query Heads β
β Β· 8 Key-Value Heads β
β Β· RoPE (ΞΈ = 1,000,000) β
β Β· KV Cache for inference β
β β Residual Add β
β β
β RMSNorm (Pre-norm) β
β β SwiGLU Feed-Forward Network β
β Β· gate_proj, up_proj, down_proj β
β Β· intermediate_size = 11,008 β
β β Residual Add β
βββββββββββββββββββββββββββββββββββββββ
β
βΌ
Final RMSNorm
β
βΌ
LM Head [4,096 β 64,000]
β
βΌ
Logits β Next Token
Architecture Highlights
| Component | Implementation | Why |
|---|---|---|
| Grouped Query Attention (GQA) | 32Q / 8KV heads β built from scratch | 4Γ less KV memory vs MHA. Same used in LLaMA-3, Mistral |
| Rotary Position Embeddings (RoPE) | Full RoPE math from scratch, ΞΈ=1M | Better long-context vs learned embeddings |
| SwiGLU FFN | gate Γ SiLU(up) through down_proj | Outperforms GELU/ReLU on LM benchmarks |
| RMSNorm | Pre-norm, no bias, no mean subtraction | ~30% faster than LayerNorm |
| Flash Attention | PyTorch 2.0 scaled_dot_product_attention |
Memory-efficient attention with O(n) space |
Parameter Count (~7B)
| Component | Parameters |
|---|---|
| Token Embedding (64K Γ 4096) | ~262M |
| Attention Γ 32 layers | ~2.15B |
| SwiGLU FFN Γ 32 layers | ~4.32B |
| RMSNorm Γ 65 | ~267K |
| LM Head | ~262M |
| Total | ~7.0B |
π€ Custom BPE Tokenizer β From Scratch
AstraGPT uses a custom Byte Pair Encoding tokenizer built entirely from scratch β no SentencePiece, no HuggingFace tokenizers library.
# Built from scratch
from tokenizer import BPETokenizer
tok = BPETokenizer(vocab_size=64_000)
tok.train(open("corpus.txt"), num_merges=60_000)
Tokenizer features:
- Byte-level base vocabulary β 256 raw bytes, handles any Unicode
- GPT-4 style pre-tokenization regex β smart word boundary splitting
- 64,000 vocab size β 60K BPE merges on top of byte base
- Built-in special tokens:
<think>,</think>,<|im_start|>,<|im_end|>, BOS, EOS, PAD apply_chat_template()β custom chat format support- Save/load β JSON-serializable merge rules
β‘ Training β Dual RTX 4090 on Private VPS
Fine-tuning was performed on a private Linux VPS with 2Γ NVIDIA RTX 4090 GPUs (total 48GB VRAM).
Hardware Setup
| Spec | Value |
|---|---|
| GPUs | 2Γ NVIDIA RTX 4090 (24GB VRAM each) |
| Total VRAM | 48 GB |
| CPU | High-core count server CPU |
| Infrastructure | Private VPS (bare metal) |
| OS | Ubuntu 22.04 LTS |
| CUDA | 12.x |
Training Pipeline β Also Built From Scratch
The SFT (Supervised Fine-Tuning) training loop was implemented from scratch with production-grade features:
# Full custom training loop
trainer = SFTTrainer(
model=model,
tokenizer=tokenizer,
dataset=dataset,
# Dual GPU via DDP
use_bf16=True,
grad_accumulation=8,
learning_rate=2e-4,
use_wandb=True,
)
trainer.train()
Training loop features:
- β Gradient accumulation β effective large batch training
- β Mixed precision (BF16) β full RTX 4090 tensor core utilization
- β Cosine LR schedule with warmup β smooth convergence
- β Gradient clipping β stable training
- β W&B logging β real-time loss/LR tracking
- β Checkpoint saving β best model tracking by loss
Fine-Tuning Hyperparameters
| Parameter | Value |
|---|---|
| Method | LoRA (PEFT) via Unsloth |
| LoRA Rank | 16 |
| LoRA Alpha | 32 |
| Target Modules | q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj |
| Max Sequence Length | 2,048 tokens |
| Effective Batch Size | 16 (2 Γ grad_accum 8) |
| Learning Rate | 2e-4 |
| LR Scheduler | Cosine with warmup |
| Warmup Ratio | 5% |
| Epochs | 3 |
| Precision | BF16 mixed precision |
| Optimizer | AdamW 8-bit |
Post-Training
After fine-tuning, the LoRA adapter was merged back into base model weights β resulting in a single, self-contained model with no external adapter dependency.
π€ Thinking / Reasoning Support
AstraGPT-7B natively generates <think> tag reasoning when triggered. This was trained in via the fine-tuning dataset, which used structured chain-of-thought formatting.
Example:
Input:
What is 15 * 47?
Output:
<think>
The multiplication involves multiplying 15 by 47.
15 Γ 47 = 15 Γ 40 + 15 Γ 7
= 600 + 105
= 705
</think>
705
Trigger thinking mode:
# Append this to your prompt to force reasoning
prompt = tokenizer.apply_chat_template(messages, ...) + "<think>\n"
β‘ Quick Start
Install
pip install transformers torch bitsandbytes accelerate
Basic Inference
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
model_id = "adityawakharkar/AstraGPT-7B"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.float16,
device_map="auto"
)
messages = [
{
"role": "system",
"content": "You are AstraGPT, a helpful coding AI built by Tantra AI Labs. Think carefully using <think>...</think> tags before answering."
},
{
"role": "user",
"content": "Write a Python function to reverse a linked list."
}
]
prompt = tokenizer.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
) + "<think>\n" # β triggers reasoning
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
output = model.generate(
**inputs,
max_new_tokens=1024,
temperature=0.3,
do_sample=True,
repetition_penalty=1.1,
pad_token_id=tokenizer.eos_token_id,
)
response = tokenizer.decode(
output[0][inputs["input_ids"].shape[1]:],
skip_special_tokens=True
)
print(response)
4-bit Quantized (Runs on ~6GB VRAM)
from transformers import BitsAndBytesConfig
bnb = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
)
model = AutoModelForCausalLM.from_pretrained(
"adityawakharkar/AstraGPT-7B",
quantization_config=bnb,
device_map="auto"
)
π Codebase
The full from-scratch implementation is open-source:
AstraGPT-7B-scratch/
βββ model/
β βββ config.py β AstraGPTConfig (7B hyperparams, 1B/3B presets)
β βββ rotary_embedding.py β RoPE from scratch (precompute + apply)
β βββ attention.py β GQA from scratch (32Q / 8KV + KV cache)
β βββ feedforward.py β SwiGLU + RMSNorm + TransformerBlock
β βββ transformer.py β Full model + generate() + save/load
βββ tokenizer/
β βββ bpe_tokenizer.py β Full BPE tokenizer (train, encode, decode)
β βββ train_tokenizer.py β Train on any text corpus
βββ training/
βββ sft_trainer.py β Complete SFT loop (grad accum, bf16, cosine LR)
Bias, Risks, and Limitations
- Hallucination: Can produce confident but incorrect answers β always verify
- Math limits: Complex multi-step math may fail β 7B is a small model
- English-primary: Best performance in English
- Reasoning trigger:
<think>tags work most reliably with explicit<think>\nprefix in prompt
Environmental Impact
- Hardware: 2Γ NVIDIA RTX 4090 (48GB combined VRAM)
- Infrastructure: Private bare-metal VPS
- Training Duration: ~3β4 hours
- Carbon Emitted: Estimated ~2β3 kgCO2eq
Citation
@misc{astragpt7b2026,
author = {Aditya Wakharkar},
title = {AstraGPT-7B: A 7B LLM Built From Scratch with Chain-of-Thought Reasoning},
year = {2026},
publisher = {HuggingFace},
organization = {Tantra AI Labs},
url = {https://huggingface.co/adityawakharkar/AstraGPT-7B},
note = {Custom architecture, custom BPE tokenizer, trained on 2Γ RTX 4090}
}
Model Card Authors
Aditya Wakharkar β @adityawakharkar | GitHub @codewith-aditya
Contact
- π GitHub: github.com/codewith-aditya
- π€ HuggingFace: @adityawakharkar
Every layer. Every weight. Every line of code.