Text Generation
Transformers
Safetensors
PEFT
English
qwen2
from-scratch
custom-architecture
custom-tokenizer
reasoning
chain-of-thought
think-tags
coding
fine-tuned
lora
unsloth
astragpt
tantra-ai-labs
rtx-4090
conversational
text-generation-inference
Instructions to use adityawakharkar/AstraGPTCoder-7B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use adityawakharkar/AstraGPTCoder-7B with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="adityawakharkar/AstraGPTCoder-7B") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("adityawakharkar/AstraGPTCoder-7B") model = AutoModelForCausalLM.from_pretrained("adityawakharkar/AstraGPTCoder-7B") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - PEFT
How to use adityawakharkar/AstraGPTCoder-7B with PEFT:
Task type is invalid.
- Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use adityawakharkar/AstraGPTCoder-7B with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "adityawakharkar/AstraGPTCoder-7B" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "adityawakharkar/AstraGPTCoder-7B", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/adityawakharkar/AstraGPTCoder-7B
- SGLang
How to use adityawakharkar/AstraGPTCoder-7B with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "adityawakharkar/AstraGPTCoder-7B" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "adityawakharkar/AstraGPTCoder-7B", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "adityawakharkar/AstraGPTCoder-7B" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "adityawakharkar/AstraGPTCoder-7B", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Unsloth Studio new
How to use adityawakharkar/AstraGPTCoder-7B with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for adityawakharkar/AstraGPTCoder-7B to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for adityawakharkar/AstraGPTCoder-7B to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for adityawakharkar/AstraGPTCoder-7B to start chatting
Load model with FastModel
pip install unsloth from unsloth import FastModel model, tokenizer = FastModel.from_pretrained( model_name="adityawakharkar/AstraGPTCoder-7B", max_seq_length=2048, ) - Docker Model Runner
How to use adityawakharkar/AstraGPTCoder-7B with Docker Model Runner:
docker model run hf.co/adityawakharkar/AstraGPTCoder-7B
| base_model: adityawakharkar/AstraGPTCoder-7B | |
| language: | |
| - en | |
| license: apache-2.0 | |
| tags: | |
| - from-scratch | |
| - custom-architecture | |
| - custom-tokenizer | |
| - reasoning | |
| - chain-of-thought | |
| - think-tags | |
| - coding | |
| - fine-tuned | |
| - lora | |
| - peft | |
| - unsloth | |
| - astragpt | |
| - tantra-ai-labs | |
| - rtx-4090 | |
| pipeline_tag: text-generation | |
| library_name: transformers | |
| model_creator: Tantra AI Labs | |
| # AstraGPT-7B π | |
| <div align="center"> | |
| **A 7-Billion Parameter Language Model β Built From Scratch** | |
| *Custom Architecture Β· Custom BPE Tokenizer Β· Reasoning Fine-Tuned on Dual RTX 4090* | |
| [](https://opensource.org/licenses/Apache-2.0) | |
| [](https://huggingface.co/adityawakharkar/AstraGPT-7B) | |
| []() | |
| [](https://www.nvidia.com) | |
| [](https://github.com/codewith-aditya) | |
| Built by **Aditya Wakharkar** | [Tantra AI Labs](https://github.com/codewith-aditya) | |
| </div> | |
| --- | |
| ## π§ What is AstraGPT-7B? | |
| AstraGPT-7B is a **7-billion parameter decoder-only language model** designed for coding and chain-of-thought reasoning. | |
| Unlike most open-source fine-tunes, **every core component of AstraGPT was designed and implemented from scratch in PyTorch** β including the transformer architecture, the BPE tokenizer, and the supervised fine-tuning pipeline. | |
| The model was then **fine-tuned on a reasoning dataset** using LoRA on a **private VPS equipped with dual NVIDIA RTX 4090 GPUs**, giving it native support for `<think>...</think>` style reasoning output. | |
| > *"Most people fine-tune models. We built one."* | |
| --- | |
| ## ποΈ Built From Scratch β Architecture Overview | |
| Every layer of AstraGPT-7B was implemented from first principles in PyTorch. No `AutoModel`, no copy-paste β pure custom code. | |
| ``` | |
| Input Token IDs | |
| β | |
| βΌ | |
| Token Embedding [64,000 β 4,096] | |
| β | |
| βΌ Γ32 Transformer Blocks | |
| βββββββββββββββββββββββββββββββββββββββ | |
| β AstraGPT Block β | |
| β β | |
| β RMSNorm (Pre-norm) β | |
| β β Grouped Query Attention (GQA) β | |
| β Β· 32 Query Heads β | |
| β Β· 8 Key-Value Heads β | |
| β Β· RoPE (ΞΈ = 1,000,000) β | |
| β Β· KV Cache for inference β | |
| β β Residual Add β | |
| β β | |
| β RMSNorm (Pre-norm) β | |
| β β SwiGLU Feed-Forward Network β | |
| β Β· gate_proj, up_proj, down_proj β | |
| β Β· intermediate_size = 11,008 β | |
| β β Residual Add β | |
| βββββββββββββββββββββββββββββββββββββββ | |
| β | |
| βΌ | |
| Final RMSNorm | |
| β | |
| βΌ | |
| LM Head [4,096 β 64,000] | |
| β | |
| βΌ | |
| Logits β Next Token | |
| ``` | |
| ### Architecture Highlights | |
| | Component | Implementation | Why | | |
| |-----------|---------------|-----| | |
| | **Grouped Query Attention (GQA)** | 32Q / 8KV heads β built from scratch | 4Γ less KV memory vs MHA. Same used in LLaMA-3, Mistral | | |
| | **Rotary Position Embeddings (RoPE)** | Full RoPE math from scratch, ΞΈ=1M | Better long-context vs learned embeddings | | |
| | **SwiGLU FFN** | gate Γ SiLU(up) through down_proj | Outperforms GELU/ReLU on LM benchmarks | | |
| | **RMSNorm** | Pre-norm, no bias, no mean subtraction | ~30% faster than LayerNorm | | |
| | **Flash Attention** | PyTorch 2.0 `scaled_dot_product_attention` | Memory-efficient attention with O(n) space | | |
| ### Parameter Count (~7B) | |
| | Component | Parameters | | |
| |-----------|-----------| | |
| | Token Embedding (64K Γ 4096) | ~262M | | |
| | Attention Γ 32 layers | ~2.15B | | |
| | SwiGLU FFN Γ 32 layers | ~4.32B | | |
| | RMSNorm Γ 65 | ~267K | | |
| | LM Head | ~262M | | |
| | **Total** | **~7.0B** | | |
| --- | |
| ## π€ Custom BPE Tokenizer β From Scratch | |
| AstraGPT uses a **custom Byte Pair Encoding tokenizer** built entirely from scratch β no SentencePiece, no HuggingFace tokenizers library. | |
| ```python | |
| # Built from scratch | |
| from tokenizer import BPETokenizer | |
| tok = BPETokenizer(vocab_size=64_000) | |
| tok.train(open("corpus.txt"), num_merges=60_000) | |
| ``` | |
| **Tokenizer features:** | |
| - **Byte-level base vocabulary** β 256 raw bytes, handles any Unicode | |
| - **GPT-4 style pre-tokenization regex** β smart word boundary splitting | |
| - **64,000 vocab size** β 60K BPE merges on top of byte base | |
| - **Built-in special tokens:** `<think>`, `</think>`, `<|im_start|>`, `<|im_end|>`, BOS, EOS, PAD | |
| - **`apply_chat_template()`** β custom chat format support | |
| - **Save/load** β JSON-serializable merge rules | |
| --- | |
| ## β‘ Training β Dual RTX 4090 on Private VPS | |
| Fine-tuning was performed on a **private Linux VPS with 2Γ NVIDIA RTX 4090 GPUs** (total 48GB VRAM). | |
| ### Hardware Setup | |
| | Spec | Value | | |
| |------|-------| | |
| | GPUs | **2Γ NVIDIA RTX 4090** (24GB VRAM each) | | |
| | Total VRAM | **48 GB** | | |
| | CPU | High-core count server CPU | | |
| | Infrastructure | Private VPS (bare metal) | | |
| | OS | Ubuntu 22.04 LTS | | |
| | CUDA | 12.x | | |
| ### Training Pipeline β Also Built From Scratch | |
| The SFT (Supervised Fine-Tuning) training loop was implemented from scratch with production-grade features: | |
| ```python | |
| # Full custom training loop | |
| trainer = SFTTrainer( | |
| model=model, | |
| tokenizer=tokenizer, | |
| dataset=dataset, | |
| # Dual GPU via DDP | |
| use_bf16=True, | |
| grad_accumulation=8, | |
| learning_rate=2e-4, | |
| use_wandb=True, | |
| ) | |
| trainer.train() | |
| ``` | |
| **Training loop features:** | |
| - β **Gradient accumulation** β effective large batch training | |
| - β **Mixed precision (BF16)** β full RTX 4090 tensor core utilization | |
| - β **Cosine LR schedule with warmup** β smooth convergence | |
| - β **Gradient clipping** β stable training | |
| - β **W&B logging** β real-time loss/LR tracking | |
| - β **Checkpoint saving** β best model tracking by loss | |
| ### Fine-Tuning Hyperparameters | |
| | Parameter | Value | | |
| |-----------|-------| | |
| | Method | LoRA (PEFT) via Unsloth | | |
| | LoRA Rank | 16 | | |
| | LoRA Alpha | 32 | | |
| | Target Modules | q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj | | |
| | Max Sequence Length | 2,048 tokens | | |
| | Effective Batch Size | 16 (2 Γ grad_accum 8) | | |
| | Learning Rate | 2e-4 | | |
| | LR Scheduler | Cosine with warmup | | |
| | Warmup Ratio | 5% | | |
| | Epochs | 3 | | |
| | Precision | BF16 mixed precision | | |
| | Optimizer | AdamW 8-bit | | |
| ### Post-Training | |
| After fine-tuning, the LoRA adapter was **merged back into base model weights** β resulting in a single, self-contained model with no external adapter dependency. | |
| --- | |
| ## π€ Thinking / Reasoning Support | |
| AstraGPT-7B natively generates `<think>` tag reasoning when triggered. This was trained in via the fine-tuning dataset, which used structured chain-of-thought formatting. | |
| **Example:** | |
| **Input:** | |
| ``` | |
| What is 15 * 47? | |
| ``` | |
| **Output:** | |
| ``` | |
| <think> | |
| The multiplication involves multiplying 15 by 47. | |
| 15 Γ 47 = 15 Γ 40 + 15 Γ 7 | |
| = 600 + 105 | |
| = 705 | |
| </think> | |
| 705 | |
| ``` | |
| **Trigger thinking mode:** | |
| ```python | |
| # Append this to your prompt to force reasoning | |
| prompt = tokenizer.apply_chat_template(messages, ...) + "<think>\n" | |
| ``` | |
| --- | |
| ## β‘ Quick Start | |
| ### Install | |
| ```bash | |
| pip install transformers torch bitsandbytes accelerate | |
| ``` | |
| ### Basic Inference | |
| ```python | |
| from transformers import AutoTokenizer, AutoModelForCausalLM | |
| import torch | |
| model_id = "adityawakharkar/AstraGPT-7B" | |
| tokenizer = AutoTokenizer.from_pretrained(model_id) | |
| model = AutoModelForCausalLM.from_pretrained( | |
| model_id, | |
| torch_dtype=torch.float16, | |
| device_map="auto" | |
| ) | |
| messages = [ | |
| { | |
| "role": "system", | |
| "content": "You are AstraGPT, a helpful coding AI built by Tantra AI Labs. Think carefully using <think>...</think> tags before answering." | |
| }, | |
| { | |
| "role": "user", | |
| "content": "Write a Python function to reverse a linked list." | |
| } | |
| ] | |
| prompt = tokenizer.apply_chat_template( | |
| messages, tokenize=False, add_generation_prompt=True | |
| ) + "<think>\n" # β triggers reasoning | |
| inputs = tokenizer(prompt, return_tensors="pt").to(model.device) | |
| with torch.no_grad(): | |
| output = model.generate( | |
| **inputs, | |
| max_new_tokens=1024, | |
| temperature=0.3, | |
| do_sample=True, | |
| repetition_penalty=1.1, | |
| pad_token_id=tokenizer.eos_token_id, | |
| ) | |
| response = tokenizer.decode( | |
| output[0][inputs["input_ids"].shape[1]:], | |
| skip_special_tokens=True | |
| ) | |
| print(response) | |
| ``` | |
| ### 4-bit Quantized (Runs on ~6GB VRAM) | |
| ```python | |
| from transformers import BitsAndBytesConfig | |
| bnb = BitsAndBytesConfig( | |
| load_in_4bit=True, | |
| bnb_4bit_quant_type="nf4", | |
| bnb_4bit_compute_dtype=torch.float16, | |
| bnb_4bit_use_double_quant=True, | |
| ) | |
| model = AutoModelForCausalLM.from_pretrained( | |
| "adityawakharkar/AstraGPT-7B", | |
| quantization_config=bnb, | |
| device_map="auto" | |
| ) | |
| ``` | |
| --- | |
| ## π Codebase | |
| The full from-scratch implementation is open-source: | |
| ``` | |
| AstraGPT-7B-scratch/ | |
| βββ model/ | |
| β βββ config.py β AstraGPTConfig (7B hyperparams, 1B/3B presets) | |
| β βββ rotary_embedding.py β RoPE from scratch (precompute + apply) | |
| β βββ attention.py β GQA from scratch (32Q / 8KV + KV cache) | |
| β βββ feedforward.py β SwiGLU + RMSNorm + TransformerBlock | |
| β βββ transformer.py β Full model + generate() + save/load | |
| βββ tokenizer/ | |
| β βββ bpe_tokenizer.py β Full BPE tokenizer (train, encode, decode) | |
| β βββ train_tokenizer.py β Train on any text corpus | |
| βββ training/ | |
| βββ sft_trainer.py β Complete SFT loop (grad accum, bf16, cosine LR) | |
| ``` | |
| --- | |
| ## Bias, Risks, and Limitations | |
| - **Hallucination:** Can produce confident but incorrect answers β always verify | |
| - **Math limits:** Complex multi-step math may fail β 7B is a small model | |
| - **English-primary:** Best performance in English | |
| - **Reasoning trigger:** `<think>` tags work most reliably with explicit `<think>\n` prefix in prompt | |
| --- | |
| ## Environmental Impact | |
| - **Hardware:** 2Γ NVIDIA RTX 4090 (48GB combined VRAM) | |
| - **Infrastructure:** Private bare-metal VPS | |
| - **Training Duration:** ~3β4 hours | |
| - **Carbon Emitted:** Estimated ~2β3 kgCO2eq | |
| --- | |
| ## Citation | |
| ```bibtex | |
| @misc{astragpt7b2026, | |
| author = {Aditya Wakharkar}, | |
| title = {AstraGPT-7B: A 7B LLM Built From Scratch with Chain-of-Thought Reasoning}, | |
| year = {2026}, | |
| publisher = {HuggingFace}, | |
| organization = {Tantra AI Labs}, | |
| url = {https://huggingface.co/adityawakharkar/AstraGPT-7B}, | |
| note = {Custom architecture, custom BPE tokenizer, trained on 2Γ RTX 4090} | |
| } | |
| ``` | |
| --- | |
| ## Model Card Authors | |
| **Aditya Wakharkar** β [@adityawakharkar](https://huggingface.co/adityawakharkar) | [GitHub @codewith-aditya](https://github.com/codewith-aditya) | |
| ## Contact | |
| - π GitHub: [github.com/codewith-aditya](https://github.com/codewith-aditya) | |
| - π€ HuggingFace: [@adityawakharkar](https://huggingface.co/adityawakharkar) | |
| --- | |
| <div align="center"> | |
| <em>Built from scratch with β€οΈ by <strong>Tantra AI Labs</strong></em><br/> | |
| <em>Every layer. Every weight. Every line of code.</em> | |
| </div> | |