Saanvi-C0-3B 🤖⚡

A production-ready LLM designed to enhance user expression and improve contextual accuracy
Powered by RAG-based technology • 4-bit quantized • Flash Attention 2 • bfloat16 • 2K context

🚀 Features

Feature	Benefit
⚡ Flash Attention 2	2.7x faster inference
🧠 4-bit Quantization	6.2GB VRAM usage
🎯 Instruction-Tuned	Better task performance
🔥 RAG-Enhanced	Contextual precision

What sets it apart?
Saanvi-C0-3B can be used prior to Retrieval-Augmented Generation (RAG) to enhance contextual matching, helping refine user intent and improve the precision of responses when paired with RAG technology. Thanks to its 4-bit quantization, it’s optimized to run efficiently even on low-end GPUs with as little as 6.2GB of VRAM, making it accessible for a wide range of hardware.

⚡ Quick Start

import argparse
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, TextStreamer

def parse_args():
    """
    Parse command-line arguments for the chat application.

    Returns:
        argparse.Namespace: Parsed arguments including model path and generation parameters.
    """
    parser = argparse.ArgumentParser(description="Streaming Terminal Chat")
    parser.add_argument("--model_path", type=str, default="riple-saanvi-lab/Saanvi-C1-3B",
                        help="Path to the pre-trained model")
    parser.add_argument("--max_length", type=int, default=512,
                        help="Maximum length for generated responses")
    parser.add_argument("--do_sample", type=bool, default=True,
                        help="Whether to use sampling during generation")
    return parser.parse_args()

def load_model_and_tokenizer(model_path: str):
    """
    Load the model and tokenizer from the specified path.

    Args:
        model_path (str): Path to the pre-trained model.

    Returns:
        tuple: A tuple containing the loaded model (AutoModelForCausalLM) and tokenizer (AutoTokenizer).

    Raises:
        SystemExit: If loading fails, exits with an error message.
    """
    try:
        tokenizer = AutoTokenizer.from_pretrained(model_path)
        model = AutoModelForCausalLM.from_pretrained(model_path, torch_dtype=torch.bfloat16, device_map="auto")
        return model, tokenizer
    except Exception as e:
        print(f"Error loading model or tokenizer: {e}")
        exit(1)

def chat_loop(model: AutoModelForCausalLM, tokenizer: AutoTokenizer, max_length: int, do_sample: bool):
    """
    Run the interactive chat loop with streaming responses.

    Args:
        model (AutoModelForCausalLM): The pre-trained language model.
        tokenizer (AutoTokenizer): The tokenizer associated with the model.
        max_length (int): Maximum length of the generated responses.
        do_sample (bool): Whether to use sampling during generation.
    """
    print("💬 Streaming Terminal Chat - Type 'exit' to quit")
    while True:
        user_input = input("\n👤 You: ").strip()
        if user_input.lower() == "exit":
            print("👋 Exiting chat...")
            break

        # Ensure inputs are on the same device as the model
        device = next(model.parameters()).device
        inputs = tokenizer(user_input, return_tensors="pt").to(device)

        # Generate and stream the response
        print("🤖 AI: ", end="", flush=True)
        streamer = TextStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)
        _ = model.generate(**inputs, max_length=max_length, do_sample=do_sample, streamer=streamer)
        print()  # Add a newline after the response

def main():
    """
    Main entry point for the chat application.
    """
    args = parse_args()
    model, tokenizer = load_model_and_tokenizer(args.model_path)
    chat_loop(model, tokenizer, args.max_length, args.do_sample)

if __name__ == "__main__":
    main()

📦 Installation

# Install dependencies with CUDA 11+ support
pip install torch transformers

📊 Benchmarks

A100-40GB Performance

Batch Size	Throughput	Latency	VRAM Usage
1	42 tok/sec	85ms	6.2GB
8	218 tok/sec	430ms	10.8GB

Low-End GPU Compatibility
With its 4-bit quantization, Saanvi-C0-3B runs smoothly on GPUs with limited VRAM (e.g., NVIDIA GTX 1660 Ti or similar with 6GB), maintaining reasonable performance for single-batch inference.

📜 License

Licensed under the Apache 2.0 License. See the LICENSE file for details.

💡 Pro Tip: For optimal performance on high-end GPUs, pair with torch.compile() and CUDA graphs. On low-end GPUs, stick to smaller batch sizes for best results!

Downloads last month: 9

Safetensors

Model size

3B params

Tensor type

BF16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for riple-saanvi-lab/Saanvi-C0-3B

Quantizations

4 models