SanskritGPT-Itihasa

A hyper-precision GPT-2 Transformer trained from scratch on the Devanagari texts of the Mahabharata and Ramayana.

Model Summary

SanskritGPT-Itihasa is a decoder-only Transformer (GPT-2 architecture) trained from scratch on classical Sanskrit epic texts (Itihasa). The model learns to generate Devanagari Sanskrit verse in the stylistic register of either the Mahabharata or the Ramayana, using style-control tokens.

This is a computational linguistic research experiment — not an authoritative source of scripture. The model captures statistical patterns of classical Sanskrit poetry including metrical adherence (Anushtubh), sandhi rules, and epic vocabulary.

🔗 Links

Resource	Link
Live Demo (Gradio App)	spaces/Dhruvil8/SanskritGPT-Itihasa
Source Code & Notebooks	github.com/Dhruvil-8/SanskritGPT-Itihasa
Training Notebook (Colab)	epic_model_training.ipynb

Model Specifications

Attribute	Value
Architecture	GPT-2 (Decoder-only Transformer)
Parameters	~42 Million
Layers	8
Attention Heads	8
Embedding Dimension	512
Context Window	512 Tokens
Tokenizer	Unigram (Metaspace) — Devanagari-native
Weight Format	Safetensors (~160 MB)
Training Hardware	Google Colab T4 GPU
Training Epochs	30
Framework	PyTorch + Hugging Face Transformers

Quick Start

Installation

pip install transformers torch

Load and Generate

from transformers import AutoTokenizer, GPT2LMHeadModel
import torch

# Load model and tokenizer from Hugging Face Hub
model_name = "Dhruvil8/SanskritGPT-Itihasa"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = GPT2LMHeadModel.from_pretrained(model_name)
model.eval()

def generate_sanskrit(prompt, style="classical", max_length=200, temperature=0.8):
    """
    Generate Sanskrit epic verse.

    Args:
        prompt: Starting words in Devanagari.
        style: "mahabharata", "ramayana", or "classical"
        max_length: Maximum tokens to generate.
        temperature: Sampling temperature (0.1 = focused, 1.5 = creative).
    """
    # Apply style-control tokens
    if style == "mahabharata":
        full_prompt = f"<MBH> {prompt}"
    elif style == "ramayana":
        full_prompt = f"<RAM> {prompt}"
    else:
        full_prompt = prompt

    inputs = tokenizer(full_prompt, return_tensors="pt")

    with torch.no_grad():
        output_ids = model.generate(
            inputs.input_ids,
            max_length=max_length,
            do_sample=True,
            temperature=temperature,
            top_p=0.95,
            repetition_penalty=1.1,
            eos_token_id=tokenizer.eos_token_id,
            pad_token_id=tokenizer.eos_token_id,
        )

    return tokenizer.decode(output_ids[0], skip_special_tokens=True)

# Example usage
output = generate_sanskrit("धर्मक्षेत्रे कुरुक्षेत्रे", style="mahabharata")
print(output)

Expected Output (Example)

धर्मक्षेत्रे कुरुक्षेत्रे समवेता युयुत्सवः ।
मामकाः पाण्डवाश्चैव किमकुर्वत संजय ॥

Training Details

Dataset

Source: bombay.indology.info (https://bombay.indology.info/)
Content: Electronic texts of all 18 Parvas of the Mahabharata and all 7 Kandas of the Ramayana.
Language: Classical Sanskrit in Devanagari script.
Corpus Size: ~24 MB of cleaned, processed Devanagari text.

Training Procedure

Tokenizer: Custom Unigram tokenizer with Metaspace pre-tokenization, trained specifically on this Sanskrit corpus with a 32,000 token vocabulary.
Style Tokens: <MBH> (Mahabharata), <RAM> (Ramayana), <eos> (End of text).
Optimizer: AdamW with weight decay 0.1.
Learning Rate: Cosine schedule with warmup.
Sequence Length: 512 tokens.
Hardware: Google Colab T4 GPU (16GB VRAM).
Duration: 30 epochs, ~2 hours total.

Evaluation

Metric	Value
Final Training Loss	~1.2
Final Validation Loss	~1.4
Perplexity (Validation)	~4.1
Metrical Adherence	High (Anushtubh meter dominant)

Intended Use

Appropriate Use Cases

Computational Linguistics Research: Analysis of Sanskrit metrical and grammatical patterns.
Digital Humanities: Exploration of AI-generated classical literature.
Educational Demos: Demonstrating Transformer capabilities on low-resource, non-Latin script languages.
Creative Writing Aid: Generating starting verses for research or artistic exploration (with human expert review).

Out-of-Scope Uses

Religious or Ritual use: Generated text is not authentic scripture.
Scholarly Translation: The model does not "understand" semantic meaning.
Authoritative Attribution: Generated text must not be attributed to classical authors.

Limitations

Statistical Mimicry: The model learns phonetic and metrical patterns — it does not possess an understanding of traditional Sanskrit semantics or philosophy.
Occasional Memorization: Due to the finite corpus size, may occasionally reproduce specific training verses.
Fixed Context: Generation quality may decline for very long sequences beyond 512 tokens.
Sandhi Imperfection: While the Unigram tokenizer is morphologically aware, complex Sandhi rules are not always perfectly applied.
Not Scripture: Generated text is not authentic Vedic or Epic scripture and must never be used for ritual, recitation, or canonical scholarly interpretation.

Citation

If you use this model in your research, please cite:

@misc{sanitgpt-itihasa-2026,
  author       = {Dhruvil},
  title        = {SanskritGPT-Itihasa: A GPT-2 Transformer for Sanskrit Epic Verse Generation},
  year         = {2026},
  publisher    = {Hugging Face},
  url          = {https://huggingface.co/Dhruvil8/SanskritGPT-Itihasa},
  note         = {Trained from scratch on Mahabharata and Ramayana Devanagari texts.}
}

License

This project is released under the MIT License. The underlying Sanskrit texts are in the public domain.

This model is an AI-assisted computational experiment. It reflects the language patterns of the corpus, not the wisdom of the tradition. For authentic scriptural guidance, consult qualified scholars and traditional sources.

Downloads last month: 226

Safetensors

Model size

41.9M params

Tensor type

F32

Dhruvil8
/

SanskritGPT-Itihasa