NanoChat Telugu 560M Supervised Fine-Tuning v1 (Akhsara)
NanoChat Telugu 560M Supervised Fine-Tuning v1 (Akhsara) is the final supervised fine-tuned version of the NanoChat Telugu model, optimized for diverse Telugu language tasks including conversational AI, text summarization, paraphrasing, content generation, and more. This model represents the culmination of the training pipeline, combining instruction-following capabilities with task-specific optimizations.
Model Details
Model Description
- Model Type: Causal Language Model (Decoder-only Transformer)
- Language: Telugu (te)
- Parameters: ~560M
- Checkpoint Type: Supervised Fine-Tuned (SFT)
- Base Model: viswamaicoe/nanochat-telugu-560M-mid-v1-akshara
- Architecture:
- Layers: 20
- Model Dimension: 1,280
- Attention Heads: 10
- Vocabulary Size: 65,536 tokens
- Maximum Sequence Length: 2,048 tokens
Tokenizer
The model uses a custom Telugu-specific BPE (Byte Pair Encoding) tokenizer trained on Telugu text. This tokenizer achieves 70-82% reduction in token count compared to general-purpose tokenizers like GPT-2 and GPT-4, making it highly efficient for Telugu text processing.
Training Data
This supervised fine-tuned checkpoint was trained on a comprehensive combination of Telugu datasets:
- SmolTalk: Instruction-following conversations for general conversational capabilities
- Chat Personalised: Telugu identity conversations for establishing consistent bot personality
- IndicParaphrase: Telugu paraphrase generation for text variation
- Custom News: AYA Telugu news articles for article and title generation tasks
- Custom Summary (TeSum): IIIT Hyderabad summarization dataset for abstractive summarization
- Custom Summary (XL-Sum): Multilingual summarization dataset (Telugu portion) for diverse summarization capabilities
- Custom Story: Telugu tiny stories (200k examples) for creative writing and story generation
Capabilities
The model excels at various Telugu language tasks:
- Conversational AI: Natural dialogue and instruction-following with consistent personality
- Text Summarization: Abstractive summarization of Telugu articles and documents
- Paraphrasing: Text rephrasing and variation generation
- Content Generation: News articles, stories, and creative writing in Telugu
- Question Answering: Answering questions in Telugu with context understanding
- Multi-task Understanding: Unified model capable of handling diverse Telugu language tasks
How to Use
Basic Text Generation
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model_id = "viswamaicoe/nanochat-telugu-560M-sft-v1-akhsara"
device = "cpu" # or torch.device("cuda" if torch.cuda.is_available() else "cpu")
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
dtype=torch.float32,
).to(device)
# Example: Generate text
text = "เฐนเฐฒเฑ, เฐฎเฑเฐฐเฑ เฐเฐฒเฐพ เฐเฐจเฑเฐจเฐพเฐฐเฑ?"
inputs = tokenizer(text, return_tensors="pt").to(device)
with torch.no_grad():
outputs = model.generate(**inputs, max_new_tokens=50)
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(generated_text)
Chat Completion
For proper conversational interactions, use the chat template with stopping criteria to handle special tokens correctly:
from transformers import AutoModelForCausalLM, AutoTokenizer, StoppingCriteria, StoppingCriteriaList
import torch
model_id = "viswamaicoe/nanochat-telugu-560M-mid-v1-akshara"
device = "cpu" # or torch.device("cuda" if torch.cuda.is_available() else "cpu")
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
dtype=torch.float32,
).to(device)
class AssistantEndStoppingCriteria(StoppingCriteria):
def __init__(self, stop_token_ids):
self.stop_token_ids = stop_token_ids
def __call__(self, input_ids, scores, **kwargs):
# Check if the last token is a stopping token
return input_ids[0, -1].item() in self.stop_token_ids
# Get stopping token IDs
assistant_end_token_id = tokenizer.encode("<|assistant_end|>", add_special_tokens=False)[0]
bos_token_id = tokenizer.bos_token_id if tokenizer.bos_token_id is not None else tokenizer.encode("<|bos|>", add_special_tokens=False)[0]
stopping_criteria = StoppingCriteriaList([
AssistantEndStoppingCriteria([assistant_end_token_id, bos_token_id])
])
# Chat completion example
conversation = [
{"role": "user", "content": "เฐจเฐฎเฐธเฑเฐเฐพเฐฐเฐ"},
]
inputs = tokenizer.apply_chat_template(
conversation, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt"
).to(device)
# Filter out token_type_ids and other unsupported keys
model_inputs = {
"input_ids": inputs["input_ids"],
}
if "attention_mask" in inputs:
model_inputs["attention_mask"] = inputs["attention_mask"]
print(f"Formatted prompt: {tokenizer.decode(model_inputs['input_ids'][0])}")
with torch.no_grad():
outputs = model.generate(
**model_inputs,
max_new_tokens=64,
do_sample=False,
stopping_criteria=stopping_criteria,
)
generated_tokens = outputs[0, model_inputs["input_ids"].shape[1]:]
print(f"Generated: {tokenizer.decode(generated_tokens)}")
Using Embeddings
You can extract sentence-level embeddings from the model for various downstream tasks like semantic search, text classification, or similarity computation:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "viswamaicoe/nanochat-telugu-560M-sft-v1-akhsara"
device = "cpu" # torch.device("cuda" if torch.cuda.is_available() else "cpu")
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
dtype=torch.float32,
).to(device)
def get_sentence_embeddings(text, pooling_strategy="mean"):
"""Extract sentence-level embeddings using pooling"""
inputs = tokenizer(text, return_tensors="pt")
input_ids = inputs["input_ids"].to(device)
attention_mask = inputs["attention_mask"].to(device)
with torch.no_grad():
outputs = model(
input_ids=input_ids,
attention_mask=attention_mask,
output_hidden_states=True
)
# Get last hidden state
last_hidden_state = outputs.hidden_states[-1]
if pooling_strategy == "mean":
# Mean pooling (excluding padding tokens)
masked_embeddings = last_hidden_state * attention_mask.unsqueeze(-1)
sentence_embedding = masked_embeddings.sum(dim=1) / attention_mask.sum(dim=1, keepdim=True)
elif pooling_strategy == "max":
# Max pooling
masked_embeddings = last_hidden_state.masked_fill(~attention_mask.unsqueeze(-1).bool(), float('-inf'))
sentence_embedding = masked_embeddings.max(dim=1)[0]
return sentence_embedding
# Example usage
text = "เฐนเฐฒเฑ, เฐฎเฑเฐฐเฑ เฐเฐฒเฐพ เฐเฐจเฑเฐจเฐพเฐฐเฑ?"
sentence_embedding = get_sentence_embeddings(text, pooling_strategy="mean")
print(f"Sentence embedding shape: {sentence_embedding.shape}")
# Sentence embedding shape: torch.Size([1, 1280])
Pooling Strategies:
- mean: Averages all token embeddings (excluding padding tokens) - recommended for general use
- max: Takes the maximum value across each dimension - useful for capturing salient features
Use Cases
- Telugu conversational AI applications and chatbots
- Telugu text summarization systems for articles and documents
- Telugu content generation tools for news, stories, and creative writing
- Telugu language understanding and question-answering systems
- Semantic search and similarity matching for Telugu text
- Telugu text paraphrasing and variation tools
- Multi-task Telugu language processing applications
Limitations
- The model is primarily trained on Telugu text and may not perform well on other languages
- As with all language models, outputs should be reviewed for accuracy and appropriateness
- The model may generate text that requires fact-checking for factual claims
- Performance may vary across different domains and specialized tasks
- The model may reflect biases present in the training data
Ethical Use and Disclaimer
Important: This model is intended for research and legitimate use cases only. Users are responsible for ensuring their use of this model complies with applicable laws and ethical guidelines.
Prohibited Uses
This model should NOT be used for:
- Generating harmful, offensive, or discriminatory content
- Creating misleading or false information
- Impersonating individuals or entities
- Generating content that violates privacy or confidentiality
- Any illegal activities or purposes that violate human rights
- Automated decision-making in critical domains (healthcare, legal, financial) without human oversight
Responsible Use Guidelines
- Fact-Checking: Always verify factual claims generated by the model, especially for important decisions
- Bias Awareness: The model may reflect biases present in its training data. Be aware of potential biases in outputs
- Human Oversight: Use human review for sensitive applications and important decisions
- Transparency: Disclose when content is AI-generated, especially in public-facing applications
- Privacy: Do not input sensitive personal information or confidential data
- Context Appropriateness: Ensure generated content is appropriate for the intended context and audience
No Warranty
This model is provided "as-is" without any warranties. The model developers and maintainers are not responsible for any misuse, damages, or consequences arising from the use of this model. Users assume full responsibility for their use of this model.
Citation
If you use this model, please cite:
@misc{nanochat-telugu-560m-sft-v1-akhsara,
title={NanoChat Telugu 560M Supervised Fine-Tuning v1 (Akhsara): A Fine-Tuned Telugu Language Model},
author={Viswam AI COE},
year={2025},
howpublished={\url{https://huggingface.co/viswamaicoe/nanochat-telugu-560M-sft-v1-akhsara}}
}
Related Models
- Base Model: viswamaicoe/nanochat-telugu-560M - Pre-trained base model
- Mid-Training Checkpoint: viswamaicoe/nanochat-telugu-560M-mid-v1-akshara - Intermediate checkpoint used as base for this model
Contact
For questions or issues, please contact the model maintainers.
- Downloads last month
- 8
Model tree for viswamaicoe/nanochat-telugu-560M-sft-v1-akhsara
Base model
viswamaicoe/nanochat-telugu-560M