SanskritGPT-Itihasa

A hyper-precision GPT-2 Transformer trained from scratch on the Devanagari texts of the Mahabharata and Ramayana.

Demo on Hugging Face Spaces GitHub Repository


Model Summary

SanskritGPT-Itihasa is a decoder-only Transformer (GPT-2 architecture) trained from scratch on classical Sanskrit epic texts (Itihasa). The model learns to generate Devanagari Sanskrit verse in the stylistic register of either the Mahabharata or the Ramayana, using style-control tokens.

This is a computational linguistic research experiment — not an authoritative source of scripture. The model captures statistical patterns of classical Sanskrit poetry including metrical adherence (Anushtubh), sandhi rules, and epic vocabulary.


🔗 Links

Resource Link
Live Demo (Gradio App) spaces/Dhruvil8/SanskritGPT-Itihasa
Source Code & Notebooks github.com/Dhruvil-8/SanskritGPT-Itihasa
Training Notebook (Colab) epic_model_training.ipynb

Model Specifications

Attribute Value
Architecture GPT-2 (Decoder-only Transformer)
Parameters ~42 Million
Layers 8
Attention Heads 8
Embedding Dimension 512
Context Window 512 Tokens
Tokenizer Unigram (Metaspace) — Devanagari-native
Weight Format Safetensors (~160 MB)
Training Hardware Google Colab T4 GPU
Training Epochs 30
Framework PyTorch + Hugging Face Transformers

Quick Start

Installation

pip install transformers torch

Load and Generate

from transformers import AutoTokenizer, GPT2LMHeadModel
import torch

# Load model and tokenizer from Hugging Face Hub
model_name = "Dhruvil8/SanskritGPT-Itihasa"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = GPT2LMHeadModel.from_pretrained(model_name)
model.eval()

def generate_sanskrit(prompt, style="classical", max_length=200, temperature=0.8):
    """
    Generate Sanskrit epic verse.

    Args:
        prompt: Starting words in Devanagari.
        style: "mahabharata", "ramayana", or "classical"
        max_length: Maximum tokens to generate.
        temperature: Sampling temperature (0.1 = focused, 1.5 = creative).
    """
    # Apply style-control tokens
    if style == "mahabharata":
        full_prompt = f"<MBH> {prompt}"
    elif style == "ramayana":
        full_prompt = f"<RAM> {prompt}"
    else:
        full_prompt = prompt

    inputs = tokenizer(full_prompt, return_tensors="pt")

    with torch.no_grad():
        output_ids = model.generate(
            inputs.input_ids,
            max_length=max_length,
            do_sample=True,
            temperature=temperature,
            top_p=0.95,
            repetition_penalty=1.1,
            eos_token_id=tokenizer.eos_token_id,
            pad_token_id=tokenizer.eos_token_id,
        )

    return tokenizer.decode(output_ids[0], skip_special_tokens=True)

# Example usage
output = generate_sanskrit("धर्मक्षेत्रे कुरुक्षेत्रे", style="mahabharata")
print(output)

Expected Output (Example)

धर्मक्षेत्रे कुरुक्षेत्रे समवेता युयुत्सवः ।
मामकाः पाण्डवाश्चैव किमकुर्वत संजय ॥

Training Details

Dataset

  • Source: bombay.indology.info (https://bombay.indology.info/)
  • Content: Electronic texts of all 18 Parvas of the Mahabharata and all 7 Kandas of the Ramayana.
  • Language: Classical Sanskrit in Devanagari script.
  • Corpus Size: ~24 MB of cleaned, processed Devanagari text.

Training Procedure

  • Tokenizer: Custom Unigram tokenizer with Metaspace pre-tokenization, trained specifically on this Sanskrit corpus with a 32,000 token vocabulary.
  • Style Tokens: <MBH> (Mahabharata), <RAM> (Ramayana), <eos> (End of text).
  • Optimizer: AdamW with weight decay 0.1.
  • Learning Rate: Cosine schedule with warmup.
  • Sequence Length: 512 tokens.
  • Hardware: Google Colab T4 GPU (16GB VRAM).
  • Duration: 30 epochs, ~2 hours total.

Evaluation

Metric Value
Final Training Loss ~1.2
Final Validation Loss ~1.4
Perplexity (Validation) ~4.1
Metrical Adherence High (Anushtubh meter dominant)

Intended Use

Appropriate Use Cases

  • Computational Linguistics Research: Analysis of Sanskrit metrical and grammatical patterns.
  • Digital Humanities: Exploration of AI-generated classical literature.
  • Educational Demos: Demonstrating Transformer capabilities on low-resource, non-Latin script languages.
  • Creative Writing Aid: Generating starting verses for research or artistic exploration (with human expert review).

Out-of-Scope Uses

  • Religious or Ritual use: Generated text is not authentic scripture.
  • Scholarly Translation: The model does not "understand" semantic meaning.
  • Authoritative Attribution: Generated text must not be attributed to classical authors.

Limitations

  • Statistical Mimicry: The model learns phonetic and metrical patterns — it does not possess an understanding of traditional Sanskrit semantics or philosophy.
  • Occasional Memorization: Due to the finite corpus size, may occasionally reproduce specific training verses.
  • Fixed Context: Generation quality may decline for very long sequences beyond 512 tokens.
  • Sandhi Imperfection: While the Unigram tokenizer is morphologically aware, complex Sandhi rules are not always perfectly applied.
  • Not Scripture: Generated text is not authentic Vedic or Epic scripture and must never be used for ritual, recitation, or canonical scholarly interpretation.

Citation

If you use this model in your research, please cite:

@misc{sanitgpt-itihasa-2026,
  author       = {Dhruvil},
  title        = {SanskritGPT-Itihasa: A GPT-2 Transformer for Sanskrit Epic Verse Generation},
  year         = {2026},
  publisher    = {Hugging Face},
  url          = {https://huggingface.co/Dhruvil8/SanskritGPT-Itihasa},
  note         = {Trained from scratch on Mahabharata and Ramayana Devanagari texts.}
}

License

This project is released under the MIT License. The underlying Sanskrit texts are in the public domain.


This model is an AI-assisted computational experiment. It reflects the language patterns of the corpus, not the wisdom of the tradition. For authentic scriptural guidance, consult qualified scholars and traditional sources.

Downloads last month
226
Safetensors
Model size
41.9M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Space using Dhruvil8/SanskritGPT-Itihasa 1