Sutra-Instruct-350M-GGUF

This repository contains the official GGUF quantizations of Sutra-Instruct-350M, a 350-million parameter model trained entirely from scratch by Abhiray.

The Sutra (सूत्र) series is designed to be a high-speed, rule-based instruction model. These GGUF versions are optimized for ultra-low latency inference on consumer-grade hardware, including CPUs and mobile devices.

📁 Available Quantizations

File Name	Quant Method	Size	Description
sutra-fp16.gguf	None (F16)	~779 MB	Original weights. RECOMMENDED MOST Best for maximum quality and accuracy.
sutra-Q8_0.gguf	Q8_0	~417 MB	Standard 8-bit. Near-lossless; RECOMMENDED for most users.
sutra-Q4_K_M.gguf	Q4_K_M	~258 MB	4-bit Medium. Its for just testing not recommended much.

🚀 Quick Start (Inference)

1. Using llama.cpp

Run the model directly from your terminal using llama-cli:

./llama-cli -m sutra-Q8_0.gguf -p "Instruction: What is the law of gravity?\n\nResponse:" -n 400 --temp 0.55 --repeat-penalty 1.2

2. Using Ollama (Plug-and-Play)

To use this with Ollama, create a file named Modelfile with the following content:

FROM ./sutra-Q8_0.gguf TEMPLATE """Instruction: {{ .Prompt }}

Response: """ PARAMETER stop "Instruction:" PARAMETER stop "\nInstruction:" PARAMETER temperature 0.55 PARAMETER repeat_penalty 1.2 PARAMETER top_k 50 PARAMETER num_predict 512

Then, initialize and run the model:

ollama create sutra -f Modelfile ollama run sutra

⚙️ Optimized Settings To prevent the model from looping or hallucinating, we strongly recommend these inference parameters:

Temperature: 0.55

Repeat Penalty: 1.2

Top-K: 50

Max Tokens: 512

Downloads last month: 1,089

GGUF

Model size

0.4B params

Architecture

gpt2

Hardware compatibility

4-bit

8-bit

View +1 variant

Model tree for Abiray/Sutra-Instruct-350M-GGUF

Base model

Abiray/Sutra-Instruct-350M

Quantized

(1)

this model

Collection including Abiray/Sutra-Instruct-350M-GGUF

Sutra Instruct model

Collection

2 items • Updated Mar 17