Saanvi-C0-3B πŸ€–βš‘

License Python 3.8+ Hugging Face

A production-ready LLM designed to enhance user expression and improve contextual accuracy
Powered by RAG-based technology β€’ 4-bit quantized β€’ Flash Attention 2 β€’ bfloat16 β€’ 2K context


πŸš€ Features

Feature Benefit
⚑ Flash Attention 2 2.7x faster inference
🧠 4-bit Quantization 6.2GB VRAM usage
🎯 Instruction-Tuned Better task performance
πŸ”₯ RAG-Enhanced Contextual precision

What sets it apart?
Saanvi-C0-3B can be used prior to Retrieval-Augmented Generation (RAG) to enhance contextual matching, helping refine user intent and improve the precision of responses when paired with RAG technology. Thanks to its 4-bit quantization, it’s optimized to run efficiently even on low-end GPUs with as little as 6.2GB of VRAM, making it accessible for a wide range of hardware.


⚑ Quick Start

import argparse
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, TextStreamer

def parse_args():
    """
    Parse command-line arguments for the chat application.

    Returns:
        argparse.Namespace: Parsed arguments including model path and generation parameters.
    """
    parser = argparse.ArgumentParser(description="Streaming Terminal Chat")
    parser.add_argument("--model_path", type=str, default="riple-saanvi-lab/Saanvi-C1-3B",
                        help="Path to the pre-trained model")
    parser.add_argument("--max_length", type=int, default=512,
                        help="Maximum length for generated responses")
    parser.add_argument("--do_sample", type=bool, default=True,
                        help="Whether to use sampling during generation")
    return parser.parse_args()

def load_model_and_tokenizer(model_path: str):
    """
    Load the model and tokenizer from the specified path.

    Args:
        model_path (str): Path to the pre-trained model.

    Returns:
        tuple: A tuple containing the loaded model (AutoModelForCausalLM) and tokenizer (AutoTokenizer).

    Raises:
        SystemExit: If loading fails, exits with an error message.
    """
    try:
        tokenizer = AutoTokenizer.from_pretrained(model_path)
        model = AutoModelForCausalLM.from_pretrained(model_path, torch_dtype=torch.bfloat16, device_map="auto")
        return model, tokenizer
    except Exception as e:
        print(f"Error loading model or tokenizer: {e}")
        exit(1)

def chat_loop(model: AutoModelForCausalLM, tokenizer: AutoTokenizer, max_length: int, do_sample: bool):
    """
    Run the interactive chat loop with streaming responses.

    Args:
        model (AutoModelForCausalLM): The pre-trained language model.
        tokenizer (AutoTokenizer): The tokenizer associated with the model.
        max_length (int): Maximum length of the generated responses.
        do_sample (bool): Whether to use sampling during generation.
    """
    print("πŸ’¬ Streaming Terminal Chat - Type 'exit' to quit")
    while True:
        user_input = input("\nπŸ‘€ You: ").strip()
        if user_input.lower() == "exit":
            print("πŸ‘‹ Exiting chat...")
            break

        # Ensure inputs are on the same device as the model
        device = next(model.parameters()).device
        inputs = tokenizer(user_input, return_tensors="pt").to(device)

        # Generate and stream the response
        print("πŸ€– AI: ", end="", flush=True)
        streamer = TextStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)
        _ = model.generate(**inputs, max_length=max_length, do_sample=do_sample, streamer=streamer)
        print()  # Add a newline after the response

def main():
    """
    Main entry point for the chat application.
    """
    args = parse_args()
    model, tokenizer = load_model_and_tokenizer(args.model_path)
    chat_loop(model, tokenizer, args.max_length, args.do_sample)

if __name__ == "__main__":
    main()

πŸ“¦ Installation

# Install dependencies with CUDA 11+ support
pip install torch transformers

πŸ“Š Benchmarks

A100-40GB Performance

Batch Size Throughput Latency VRAM Usage
1 42 tok/sec 85ms 6.2GB
8 218 tok/sec 430ms 10.8GB

Low-End GPU Compatibility
With its 4-bit quantization, Saanvi-C0-3B runs smoothly on GPUs with limited VRAM (e.g., NVIDIA GTX 1660 Ti or similar with 6GB), maintaining reasonable performance for single-batch inference.


πŸ“œ License

Licensed under the Apache 2.0 License. See the LICENSE file for details.


πŸ’‘ Pro Tip: For optimal performance on high-end GPUs, pair with torch.compile() and CUDA graphs. On low-end GPUs, stick to smaller batch sizes for best results!

Downloads last month
9
Safetensors
Model size
3B params
Tensor type
BF16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for riple-saanvi-lab/Saanvi-C0-3B

Quantizations
4 models