Instructions to use adityawakharkar/AstraGPTCoder-7B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use adityawakharkar/AstraGPTCoder-7B with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="adityawakharkar/AstraGPTCoder-7B")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("adityawakharkar/AstraGPTCoder-7B")
model = AutoModelForCausalLM.from_pretrained("adityawakharkar/AstraGPTCoder-7B")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

PEFT
How to use adityawakharkar/AstraGPTCoder-7B with PEFT:
```
Task type is invalid.
```
Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use adityawakharkar/AstraGPTCoder-7B with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "adityawakharkar/AstraGPTCoder-7B"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "adityawakharkar/AstraGPTCoder-7B",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/adityawakharkar/AstraGPTCoder-7B

SGLang

How to use adityawakharkar/AstraGPTCoder-7B with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "adityawakharkar/AstraGPTCoder-7B" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "adityawakharkar/AstraGPTCoder-7B",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "adityawakharkar/AstraGPTCoder-7B" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "adityawakharkar/AstraGPTCoder-7B",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Unsloth Studio new

How to use adityawakharkar/AstraGPTCoder-7B with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for adityawakharkar/AstraGPTCoder-7B to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for adityawakharkar/AstraGPTCoder-7B to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for adityawakharkar/AstraGPTCoder-7B to start chatting

Load model with FastModel

pip install unsloth
from unsloth import FastModel
model, tokenizer = FastModel.from_pretrained(
    model_name="adityawakharkar/AstraGPTCoder-7B",
    max_seq_length=2048,
)

Docker Model Runner
How to use adityawakharkar/AstraGPTCoder-7B with Docker Model Runner:
```
docker model run hf.co/adityawakharkar/AstraGPTCoder-7B
```

AstraGPTCoder-7B / README.md

adityawakharkar

Update README.md

6921286 verified about 1 month ago

preview code

raw

history blame contribute delete

11.5 kB

	---
	base_model: adityawakharkar/AstraGPTCoder-7B
	language:
	- en
	license: apache-2.0
	tags:
	- from-scratch
	- custom-architecture
	- custom-tokenizer
	- reasoning
	- chain-of-thought
	- think-tags
	- coding
	- fine-tuned
	- lora
	- peft
	- unsloth
	- astragpt
	- tantra-ai-labs
	- rtx-4090
	pipeline_tag: text-generation
	library_name: transformers
	model_creator: Tantra AI Labs
	---

	# AstraGPT-7B 🚀

	<div align="center">

	A 7-Billion Parameter Language Model — Built From Scratch

	Custom Architecture · Custom BPE Tokenizer · Reasoning Fine-Tuned on Dual RTX 4090

	[![License](https://img.shields.io/badge/License-Apache%202.0-green.svg)](https://opensource.org/licenses/Apache-2.0)
	[![Model](https://img.shields.io/badge/HuggingFace-AstraGPT--7B-yellow)](https://huggingface.co/adityawakharkar/AstraGPT-7B)
	[![Params](https://img.shields.io/badge/Parameters-7B-blue)]()
	[![GPU](https://img.shields.io/badge/Trained%20On-2×%20RTX%204090-76b900?logo=nvidia)](https://www.nvidia.com)
	[![By](https://img.shields.io/badge/By-Tantra%20AI%20Labs-purple)](https://github.com/codewith-aditya)

	Built by Aditya Wakharkar \| [Tantra AI Labs](https://github.com/codewith-aditya)

	</div>

	---

	## 🧠 What is AstraGPT-7B?

	AstraGPT-7B is a 7-billion parameter decoder-only language model designed for coding and chain-of-thought reasoning.

	Unlike most open-source fine-tunes, every core component of AstraGPT was designed and implemented from scratch in PyTorch — including the transformer architecture, the BPE tokenizer, and the supervised fine-tuning pipeline.

	The model was then fine-tuned on a reasoning dataset using LoRA on a private VPS equipped with dual NVIDIA RTX 4090 GPUs, giving it native support for `<think>...</think>` style reasoning output.

	> "Most people fine-tune models. We built one."

	---

	## 🏗️ Built From Scratch — Architecture Overview

	Every layer of AstraGPT-7B was implemented from first principles in PyTorch. No `AutoModel`, no copy-paste — pure custom code.

	```
	Input Token IDs
	│
	▼
	Token Embedding [64,000 → 4,096]
	│
	▼ ×32 Transformer Blocks
	┌─────────────────────────────────────┐
	│ AstraGPT Block │
	│ │
	│ RMSNorm (Pre-norm) │
	│ → Grouped Query Attention (GQA) │
	│ · 32 Query Heads │
	│ · 8 Key-Value Heads │
	│ · RoPE (θ = 1,000,000) │
	│ · KV Cache for inference │
	│ → Residual Add │
	│ │
	│ RMSNorm (Pre-norm) │
	│ → SwiGLU Feed-Forward Network │
	│ · gate_proj, up_proj, down_proj │
	│ · intermediate_size = 11,008 │
	│ → Residual Add │
	└─────────────────────────────────────┘
	│
	▼
	Final RMSNorm
	│
	▼
	LM Head [4,096 → 64,000]
	│
	▼
	Logits → Next Token
	```

	### Architecture Highlights

	\| Component \| Implementation \| Why \|
	\|-----------\|---------------\|-----\|
	\| Grouped Query Attention (GQA) \| 32Q / 8KV heads — built from scratch \| 4× less KV memory vs MHA. Same used in LLaMA-3, Mistral \|
	\| Rotary Position Embeddings (RoPE) \| Full RoPE math from scratch, θ=1M \| Better long-context vs learned embeddings \|
	\| SwiGLU FFN \| gate × SiLU(up) through down_proj \| Outperforms GELU/ReLU on LM benchmarks \|
	\| RMSNorm \| Pre-norm, no bias, no mean subtraction \| ~30% faster than LayerNorm \|
	\| Flash Attention \| PyTorch 2.0 `scaled_dot_product_attention` \| Memory-efficient attention with O(n) space \|

	### Parameter Count (~7B)

	\| Component \| Parameters \|
	\|-----------\|-----------\|
	\| Token Embedding (64K × 4096) \| ~262M \|
	\| Attention × 32 layers \| ~2.15B \|
	\| SwiGLU FFN × 32 layers \| ~4.32B \|
	\| RMSNorm × 65 \| ~267K \|
	\| LM Head \| ~262M \|
	\| Total \| ~7.0B \|

	---

	## 🔤 Custom BPE Tokenizer — From Scratch

	AstraGPT uses a custom Byte Pair Encoding tokenizer built entirely from scratch — no SentencePiece, no HuggingFace tokenizers library.

	```python
	# Built from scratch
	from tokenizer import BPETokenizer

	tok = BPETokenizer(vocab_size=64_000)
	tok.train(open("corpus.txt"), num_merges=60_000)
	```

	Tokenizer features:
	- Byte-level base vocabulary — 256 raw bytes, handles any Unicode
	- GPT-4 style pre-tokenization regex — smart word boundary splitting
	- 64,000 vocab size — 60K BPE merges on top of byte base
	- Built-in special tokens: `<think>`, `</think>`, `<\|im_start\|>`, `<\|im_end\|>`, BOS, EOS, PAD
	- `apply_chat_template()` — custom chat format support
	- Save/load — JSON-serializable merge rules

	---

	## ⚡ Training — Dual RTX 4090 on Private VPS

	Fine-tuning was performed on a private Linux VPS with 2× NVIDIA RTX 4090 GPUs (total 48GB VRAM).

	### Hardware Setup

	\| Spec \| Value \|
	\|------\|-------\|
	\| GPUs \| 2× NVIDIA RTX 4090 (24GB VRAM each) \|
	\| Total VRAM \| 48 GB \|
	\| CPU \| High-core count server CPU \|
	\| Infrastructure \| Private VPS (bare metal) \|
	\| OS \| Ubuntu 22.04 LTS \|
	\| CUDA \| 12.x \|

	### Training Pipeline — Also Built From Scratch

	The SFT (Supervised Fine-Tuning) training loop was implemented from scratch with production-grade features:

	```python
	# Full custom training loop
	trainer = SFTTrainer(
	model=model,
	tokenizer=tokenizer,
	dataset=dataset,
	# Dual GPU via DDP
	use_bf16=True,
	grad_accumulation=8,
	learning_rate=2e-4,
	use_wandb=True,
	)
	trainer.train()
	```

	Training loop features:
	- ✅ Gradient accumulation — effective large batch training
	- ✅ Mixed precision (BF16) — full RTX 4090 tensor core utilization
	- ✅ Cosine LR schedule with warmup — smooth convergence
	- ✅ Gradient clipping — stable training
	- ✅ W&B logging — real-time loss/LR tracking
	- ✅ Checkpoint saving — best model tracking by loss

	### Fine-Tuning Hyperparameters

	\| Parameter \| Value \|
	\|-----------\|-------\|
	\| Method \| LoRA (PEFT) via Unsloth \|
	\| LoRA Rank \| 16 \|
	\| LoRA Alpha \| 32 \|
	\| Target Modules \| q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj \|
	\| Max Sequence Length \| 2,048 tokens \|
	\| Effective Batch Size \| 16 (2 × grad_accum 8) \|
	\| Learning Rate \| 2e-4 \|
	\| LR Scheduler \| Cosine with warmup \|
	\| Warmup Ratio \| 5% \|
	\| Epochs \| 3 \|
	\| Precision \| BF16 mixed precision \|
	\| Optimizer \| AdamW 8-bit \|

	### Post-Training

	After fine-tuning, the LoRA adapter was merged back into base model weights — resulting in a single, self-contained model with no external adapter dependency.

	---

	## 🤔 Thinking / Reasoning Support

	AstraGPT-7B natively generates `<think>` tag reasoning when triggered. This was trained in via the fine-tuning dataset, which used structured chain-of-thought formatting.

	Example:

	Input:
	```
	What is 15 * 47?
	```

	Output:
	```
	<think>
	The multiplication involves multiplying 15 by 47.
	15 × 47 = 15 × 40 + 15 × 7
	= 600 + 105
	= 705
	</think>
	705
	```

	Trigger thinking mode:
	```python
	# Append this to your prompt to force reasoning
	prompt = tokenizer.apply_chat_template(messages, ...) + "<think>\n"
	```

	---

	## ⚡ Quick Start

	### Install

	```bash
	pip install transformers torch bitsandbytes accelerate
	```

	### Basic Inference

	```python
	from transformers import AutoTokenizer, AutoModelForCausalLM
	import torch

	model_id = "adityawakharkar/AstraGPT-7B"

	tokenizer = AutoTokenizer.from_pretrained(model_id)
	model = AutoModelForCausalLM.from_pretrained(
	model_id,
	torch_dtype=torch.float16,
	device_map="auto"
	)

	messages = [
	{
	"role": "system",
	"content": "You are AstraGPT, a helpful coding AI built by Tantra AI Labs. Think carefully using <think>...</think> tags before answering."
	},
	{
	"role": "user",
	"content": "Write a Python function to reverse a linked list."
	}
	]

	prompt = tokenizer.apply_chat_template(
	messages, tokenize=False, add_generation_prompt=True
	) + "<think>\n" # ← triggers reasoning

	inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

	with torch.no_grad():
	output = model.generate(
	**inputs,
	max_new_tokens=1024,
	temperature=0.3,
	do_sample=True,
	repetition_penalty=1.1,
	pad_token_id=tokenizer.eos_token_id,
	)

	response = tokenizer.decode(
	output[0][inputs["input_ids"].shape[1]:],
	skip_special_tokens=True
	)
	print(response)
	```

	### 4-bit Quantized (Runs on ~6GB VRAM)

	```python
	from transformers import BitsAndBytesConfig

	bnb = BitsAndBytesConfig(
	load_in_4bit=True,
	bnb_4bit_quant_type="nf4",
	bnb_4bit_compute_dtype=torch.float16,
	bnb_4bit_use_double_quant=True,
	)

	model = AutoModelForCausalLM.from_pretrained(
	"adityawakharkar/AstraGPT-7B",
	quantization_config=bnb,
	device_map="auto"
	)
	```

	---

	## 📁 Codebase

	The full from-scratch implementation is open-source:

	```
	AstraGPT-7B-scratch/
	├── model/
	│ ├── config.py ← AstraGPTConfig (7B hyperparams, 1B/3B presets)
	│ ├── rotary_embedding.py ← RoPE from scratch (precompute + apply)
	│ ├── attention.py ← GQA from scratch (32Q / 8KV + KV cache)
	│ ├── feedforward.py ← SwiGLU + RMSNorm + TransformerBlock
	│ └── transformer.py ← Full model + generate() + save/load
	├── tokenizer/
	│ ├── bpe_tokenizer.py ← Full BPE tokenizer (train, encode, decode)
	│ └── train_tokenizer.py ← Train on any text corpus
	└── training/
	└── sft_trainer.py ← Complete SFT loop (grad accum, bf16, cosine LR)
	```

	---

	## Bias, Risks, and Limitations

	- Hallucination: Can produce confident but incorrect answers — always verify
	- Math limits: Complex multi-step math may fail — 7B is a small model
	- English-primary: Best performance in English
	- Reasoning trigger: `<think>` tags work most reliably with explicit `<think>\n` prefix in prompt

	---

	## Environmental Impact

	- Hardware: 2× NVIDIA RTX 4090 (48GB combined VRAM)
	- Infrastructure: Private bare-metal VPS
	- Training Duration: ~3–4 hours
	- Carbon Emitted: Estimated ~2–3 kgCO2eq

	---

	## Citation

	```bibtex
	@misc{astragpt7b2026,
	author = {Aditya Wakharkar},
	title = {AstraGPT-7B: A 7B LLM Built From Scratch with Chain-of-Thought Reasoning},
	year = {2026},
	publisher = {HuggingFace},
	organization = {Tantra AI Labs},
	url = {https://huggingface.co/adityawakharkar/AstraGPT-7B},
	note = {Custom architecture, custom BPE tokenizer, trained on 2× RTX 4090}
	}
	```

	---

	## Model Card Authors

	Aditya Wakharkar — [@adityawakharkar](https://huggingface.co/adityawakharkar) \| [GitHub @codewith-aditya](https://github.com/codewith-aditya)

	## Contact

	- 🐙 GitHub: [github.com/codewith-aditya](https://github.com/codewith-aditya)
	- 🤗 HuggingFace: [@adityawakharkar](https://huggingface.co/adityawakharkar)

	---

	<div align="center">
	<em>Built from scratch with ❤️ by <strong>Tantra AI Labs</strong></em><br/>
	<em>Every layer. Every weight. Every line of code.</em>
	</div>