NanoChat Telugu 560M Mid-Training v1 (Akhsara)

NanoChat Telugu 560M Mid-Training v1 (Akhsara) is an intermediate checkpoint of the NanoChat Telugu model, trained on diverse Telugu conversational and task-specific datasets. This checkpoint bridges the gap between pre-training and supervised fine-tuning, providing enhanced instruction-following and conversational capabilities while serving as a foundation for further task-specific fine-tuning.

Model Details

Model Description

  • Model Type: Causal Language Model (Decoder-only Transformer)
  • Language: Telugu (te)
  • Parameters: ~560M
  • Checkpoint Type: Mid-training checkpoint
  • Base Model: viswamaicoe/nanochat-telugu-560M
  • Architecture:
    • Layers: 20
    • Model Dimension: 1,280
    • Attention Heads: 10
    • Vocabulary Size: 65,536 tokens
    • Maximum Sequence Length: 2,048 tokens

Tokenizer

The model uses a custom Telugu-specific BPE (Byte Pair Encoding) tokenizer trained on Telugu text. This tokenizer achieves 70-82% reduction in token count compared to general-purpose tokenizers like GPT-2 and GPT-4, making it highly efficient for Telugu text processing.

Training Data

This mid-training checkpoint was trained on a comprehensive mixture of Telugu datasets including:

  • Instruction-Following: SmolTalk dataset for general conversational capabilities
  • News Articles: News article generation and title creation tasks
  • Summarization: TeSum (IIIT Hyderabad) and XL-Sum datasets for abstractive summarization
  • Paraphrasing: IndicParaphrase dataset for text rephrasing and variation
  • Story Generation: Telugu stories dataset for creative writing
  • Personality & Identity: Custom datasets for establishing consistent bot personality and safety guardrails

Capabilities

The model demonstrates improved capabilities in:

  • Conversational AI: Enhanced natural dialogue and instruction-following compared to the base pre-trained model
  • Text Summarization: Abstractive summarization of Telugu articles
  • Paraphrasing: Text rephrasing and variation generation
  • Content Generation: News articles, stories, and creative writing
  • Task-Specific Understanding: Better understanding of user instructions and task requirements

How to Use

Basic Text Generation

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_id = "viswamaicoe/nanochat-telugu-560M-mid-v1-akshara"
device = "cpu"  # or torch.device("cuda" if torch.cuda.is_available() else "cpu")

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    dtype=torch.float32,
).to(device)

# Example: Generate text
text = "เฐนเฐฒเฑ‹, เฐฎเฑ€เฐฐเฑ เฐŽเฐฒเฐพ เฐ‰เฐจเฑเฐจเฐพเฐฐเฑ?"
inputs = tokenizer(text, return_tensors="pt").to(device)

with torch.no_grad():
    outputs = model.generate(**inputs, max_new_tokens=50)
    
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(generated_text)

Chat Completion

For proper conversational interactions, use the chat template with stopping criteria to handle special tokens correctly:

from transformers import AutoModelForCausalLM, AutoTokenizer, StoppingCriteria, StoppingCriteriaList
import torch

model_id = "viswamaicoe/nanochat-telugu-560M-mid-v1-akshara"
device = "cpu"  # or torch.device("cuda" if torch.cuda.is_available() else "cpu")

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    dtype=torch.float32,
).to(device)


class AssistantEndStoppingCriteria(StoppingCriteria):
    def __init__(self, stop_token_ids):
        self.stop_token_ids = stop_token_ids
    
    def __call__(self, input_ids, scores, **kwargs):
        # Check if the last token is a stopping token
        return input_ids[0, -1].item() in self.stop_token_ids

# Get stopping token IDs
assistant_end_token_id = tokenizer.encode("<|assistant_end|>", add_special_tokens=False)[0]
bos_token_id = tokenizer.bos_token_id if tokenizer.bos_token_id is not None else tokenizer.encode("<|bos|>", add_special_tokens=False)[0]

stopping_criteria = StoppingCriteriaList([
    AssistantEndStoppingCriteria([assistant_end_token_id, bos_token_id])
])

# Chat completion example
conversation = [
    {"role": "user", "content": "เฐจเฐฎเฐธเฑเฐ•เฐพเฐฐเฐ‚"},
]

inputs = tokenizer.apply_chat_template(
    conversation, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt"
).to(device)

# Filter out token_type_ids and other unsupported keys
model_inputs = {
    "input_ids": inputs["input_ids"],
}
if "attention_mask" in inputs:
    model_inputs["attention_mask"] = inputs["attention_mask"]

print(f"Formatted prompt: {tokenizer.decode(model_inputs['input_ids'][0])}")

with torch.no_grad():
    outputs = model.generate(
        **model_inputs,
        max_new_tokens=64,
        do_sample=False,
        stopping_criteria=stopping_criteria,
    )

generated_tokens = outputs[0, model_inputs["input_ids"].shape[1]:]
print(f"Generated: {tokenizer.decode(generated_tokens)}")

Using Embeddings

You can extract sentence-level embeddings from the model for various downstream tasks like semantic search, text classification, or similarity computation:

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "viswamaicoe/nanochat-telugu-560M-mid-v1-akshara"
device = "cpu"  # torch.device("cuda" if torch.cuda.is_available() else "cpu")

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    dtype=torch.float32,
).to(device)


def get_sentence_embeddings(text, pooling_strategy="mean"):
    """Extract sentence-level embeddings using pooling"""
    inputs = tokenizer(text, return_tensors="pt")
    input_ids = inputs["input_ids"].to(device)
    attention_mask = inputs["attention_mask"].to(device)
    
    with torch.no_grad():
        outputs = model(
            input_ids=input_ids,
            attention_mask=attention_mask,
            output_hidden_states=True
        )
    
    # Get last hidden state
    last_hidden_state = outputs.hidden_states[-1]
    
    if pooling_strategy == "mean":
        # Mean pooling (excluding padding tokens)
        masked_embeddings = last_hidden_state * attention_mask.unsqueeze(-1)
        sentence_embedding = masked_embeddings.sum(dim=1) / attention_mask.sum(dim=1, keepdim=True)
    elif pooling_strategy == "max":
        # Max pooling
        masked_embeddings = last_hidden_state.masked_fill(~attention_mask.unsqueeze(-1).bool(), float('-inf'))
        sentence_embedding = masked_embeddings.max(dim=1)[0]
    
    return sentence_embedding

# Example usage
text = "เฐนเฐฒเฑ‹, เฐฎเฑ€เฐฐเฑ เฐŽเฐฒเฐพ เฐ‰เฐจเฑเฐจเฐพเฐฐเฑ?"
sentence_embedding = get_sentence_embeddings(text, pooling_strategy="mean")
print(f"Sentence embedding shape: {sentence_embedding.shape}")
# Sentence embedding shape: torch.Size([1, 1280])

Pooling Strategies:

  • mean: Averages all token embeddings (excluding padding tokens) - recommended for general use
  • max: Takes the maximum value across each dimension - useful for capturing salient features

Use Cases

  • Telugu conversational AI applications
  • Telugu text summarization systems
  • Telugu content generation tools
  • Telugu language understanding tasks
  • Semantic search and similarity matching for Telugu text
  • Foundation model for further task-specific fine-tuning

Limitations

  • This is an intermediate checkpoint and may not be as refined as fully fine-tuned models
  • The model is primarily trained on Telugu text and may not perform well on other languages
  • As with all language models, outputs should be reviewed for accuracy and appropriateness
  • The model may generate text that requires fact-checking for factual claims
  • Performance may vary across different task domains

Ethical Use and Disclaimer

Important: This model is intended for research and legitimate use cases only. Users are responsible for ensuring their use of this model complies with applicable laws and ethical guidelines.

Prohibited Uses

This model should NOT be used for:

  • Generating harmful, offensive, or discriminatory content
  • Creating misleading or false information
  • Impersonating individuals or entities
  • Generating content that violates privacy or confidentiality
  • Any illegal activities or purposes that violate human rights
  • Automated decision-making in critical domains (healthcare, legal, financial) without human oversight

Responsible Use Guidelines

  • Fact-Checking: Always verify factual claims generated by the model, especially for important decisions
  • Bias Awareness: The model may reflect biases present in its training data. Be aware of potential biases in outputs
  • Human Oversight: Use human review for sensitive applications and important decisions
  • Transparency: Disclose when content is AI-generated, especially in public-facing applications
  • Privacy: Do not input sensitive personal information or confidential data
  • Context Appropriateness: Ensure generated content is appropriate for the intended context and audience

No Warranty

This model is provided "as-is" without any warranties. The model developers and maintainers are not responsible for any misuse, damages, or consequences arising from the use of this model. Users assume full responsibility for their use of this model.

Citation

If you use this model, please cite:

@misc{nanochat-telugu-560m-mid-v1-akshara,
  title={NanoChat Telugu 560M Mid-Training v1 (Akhsara): A Telugu Language Model Checkpoint},
  author={Viswam AI COE},
  year={2025},
  howpublished={\url{https://huggingface.co/viswamaicoe/nanochat-telugu-560M-mid-v1-akshara}}
}

Related Models

Contact

For questions or issues, please contact the model maintainers.

Downloads last month
4
Safetensors
Model size
0.6B params
Tensor type
BF16
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for viswamaicoe/nanochat-telugu-560M-mid-v1-akshara

Finetuned
(1)
this model
Finetunes
1 model

Space using viswamaicoe/nanochat-telugu-560M-mid-v1-akshara 1

Collection including viswamaicoe/nanochat-telugu-560M-mid-v1-akshara