SanskritGPT-Itihasa
A hyper-precision GPT-2 Transformer trained from scratch on the Devanagari texts of the Mahabharata and Ramayana.
Model Summary
SanskritGPT-Itihasa is a decoder-only Transformer (GPT-2 architecture) trained from scratch on classical Sanskrit epic texts (Itihasa). The model learns to generate Devanagari Sanskrit verse in the stylistic register of either the Mahabharata or the Ramayana, using style-control tokens.
This is a computational linguistic research experiment — not an authoritative source of scripture. The model captures statistical patterns of classical Sanskrit poetry including metrical adherence (Anushtubh), sandhi rules, and epic vocabulary.
🔗 Links
| Resource | Link |
|---|---|
| Live Demo (Gradio App) | spaces/Dhruvil8/SanskritGPT-Itihasa |
| Source Code & Notebooks | github.com/Dhruvil-8/SanskritGPT-Itihasa |
| Training Notebook (Colab) | epic_model_training.ipynb |
Model Specifications
| Attribute | Value |
|---|---|
| Architecture | GPT-2 (Decoder-only Transformer) |
| Parameters | ~42 Million |
| Layers | 8 |
| Attention Heads | 8 |
| Embedding Dimension | 512 |
| Context Window | 512 Tokens |
| Tokenizer | Unigram (Metaspace) — Devanagari-native |
| Weight Format | Safetensors (~160 MB) |
| Training Hardware | Google Colab T4 GPU |
| Training Epochs | 30 |
| Framework | PyTorch + Hugging Face Transformers |
Quick Start
Installation
pip install transformers torch
Load and Generate
from transformers import AutoTokenizer, GPT2LMHeadModel
import torch
# Load model and tokenizer from Hugging Face Hub
model_name = "Dhruvil8/SanskritGPT-Itihasa"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = GPT2LMHeadModel.from_pretrained(model_name)
model.eval()
def generate_sanskrit(prompt, style="classical", max_length=200, temperature=0.8):
"""
Generate Sanskrit epic verse.
Args:
prompt: Starting words in Devanagari.
style: "mahabharata", "ramayana", or "classical"
max_length: Maximum tokens to generate.
temperature: Sampling temperature (0.1 = focused, 1.5 = creative).
"""
# Apply style-control tokens
if style == "mahabharata":
full_prompt = f"<MBH> {prompt}"
elif style == "ramayana":
full_prompt = f"<RAM> {prompt}"
else:
full_prompt = prompt
inputs = tokenizer(full_prompt, return_tensors="pt")
with torch.no_grad():
output_ids = model.generate(
inputs.input_ids,
max_length=max_length,
do_sample=True,
temperature=temperature,
top_p=0.95,
repetition_penalty=1.1,
eos_token_id=tokenizer.eos_token_id,
pad_token_id=tokenizer.eos_token_id,
)
return tokenizer.decode(output_ids[0], skip_special_tokens=True)
# Example usage
output = generate_sanskrit("धर्मक्षेत्रे कुरुक्षेत्रे", style="mahabharata")
print(output)
Expected Output (Example)
धर्मक्षेत्रे कुरुक्षेत्रे समवेता युयुत्सवः ।
मामकाः पाण्डवाश्चैव किमकुर्वत संजय ॥
Training Details
Dataset
- Source: bombay.indology.info (https://bombay.indology.info/)
- Content: Electronic texts of all 18 Parvas of the Mahabharata and all 7 Kandas of the Ramayana.
- Language: Classical Sanskrit in Devanagari script.
- Corpus Size: ~24 MB of cleaned, processed Devanagari text.
Training Procedure
- Tokenizer: Custom Unigram tokenizer with Metaspace pre-tokenization, trained specifically on this Sanskrit corpus with a 32,000 token vocabulary.
- Style Tokens:
<MBH>(Mahabharata),<RAM>(Ramayana),<eos>(End of text). - Optimizer: AdamW with weight decay 0.1.
- Learning Rate: Cosine schedule with warmup.
- Sequence Length: 512 tokens.
- Hardware: Google Colab T4 GPU (16GB VRAM).
- Duration: 30 epochs, ~2 hours total.
Evaluation
| Metric | Value |
|---|---|
| Final Training Loss | ~1.2 |
| Final Validation Loss | ~1.4 |
| Perplexity (Validation) | ~4.1 |
| Metrical Adherence | High (Anushtubh meter dominant) |
Intended Use
Appropriate Use Cases
- Computational Linguistics Research: Analysis of Sanskrit metrical and grammatical patterns.
- Digital Humanities: Exploration of AI-generated classical literature.
- Educational Demos: Demonstrating Transformer capabilities on low-resource, non-Latin script languages.
- Creative Writing Aid: Generating starting verses for research or artistic exploration (with human expert review).
Out-of-Scope Uses
- Religious or Ritual use: Generated text is not authentic scripture.
- Scholarly Translation: The model does not "understand" semantic meaning.
- Authoritative Attribution: Generated text must not be attributed to classical authors.
Limitations
- Statistical Mimicry: The model learns phonetic and metrical patterns — it does not possess an understanding of traditional Sanskrit semantics or philosophy.
- Occasional Memorization: Due to the finite corpus size, may occasionally reproduce specific training verses.
- Fixed Context: Generation quality may decline for very long sequences beyond 512 tokens.
- Sandhi Imperfection: While the Unigram tokenizer is morphologically aware, complex Sandhi rules are not always perfectly applied.
- Not Scripture: Generated text is not authentic Vedic or Epic scripture and must never be used for ritual, recitation, or canonical scholarly interpretation.
Citation
If you use this model in your research, please cite:
@misc{sanitgpt-itihasa-2026,
author = {Dhruvil},
title = {SanskritGPT-Itihasa: A GPT-2 Transformer for Sanskrit Epic Verse Generation},
year = {2026},
publisher = {Hugging Face},
url = {https://huggingface.co/Dhruvil8/SanskritGPT-Itihasa},
note = {Trained from scratch on Mahabharata and Ramayana Devanagari texts.}
}
License
This project is released under the MIT License. The underlying Sanskrit texts are in the public domain.
This model is an AI-assisted computational experiment. It reflects the language patterns of the corpus, not the wisdom of the tradition. For authentic scriptural guidance, consult qualified scholars and traditional sources.
- Downloads last month
- 226