Sutra-Instruct-350M-GGUF

This repository contains the official GGUF quantizations of Sutra-Instruct-350M, a 350-million parameter model trained entirely from scratch by Abhiray.

The Sutra (ΰ€Έΰ₯‚ΰ€€ΰ₯ΰ€°) series is designed to be a high-speed, rule-based instruction model. These GGUF versions are optimized for ultra-low latency inference on consumer-grade hardware, including CPUs and mobile devices.


πŸ“ Available Quantizations

File Name Quant Method Size Description
sutra-fp16.gguf None (F16) ~779 MB Original weights. RECOMMENDED MOST Best for maximum quality and accuracy.
sutra-Q8_0.gguf Q8_0 ~417 MB Standard 8-bit. Near-lossless; RECOMMENDED for most users.
sutra-Q4_K_M.gguf Q4_K_M ~258 MB 4-bit Medium. Its for just testing not recommended much.

πŸš€ Quick Start (Inference)

1. Using llama.cpp

Run the model directly from your terminal using llama-cli:

./llama-cli -m sutra-Q8_0.gguf -p "Instruction: What is the law of gravity?\n\nResponse:" -n 400 --temp 0.55 --repeat-penalty 1.2

2. Using Ollama (Plug-and-Play)

To use this with Ollama, create a file named Modelfile with the following content:

FROM ./sutra-Q8_0.gguf TEMPLATE """Instruction: {{ .Prompt }}

Response: """ PARAMETER stop "Instruction:" PARAMETER stop "\nInstruction:" PARAMETER temperature 0.55 PARAMETER repeat_penalty 1.2 PARAMETER top_k 50 PARAMETER num_predict 512

Then, initialize and run the model:

ollama create sutra -f Modelfile ollama run sutra

βš™οΈ Optimized Settings To prevent the model from looping or hallucinating, we strongly recommend these inference parameters:

Temperature: 0.55

Repeat Penalty: 1.2

Top-K: 50

Max Tokens: 512

Downloads last month
1,089
GGUF
Model size
0.4B params
Architecture
gpt2
Hardware compatibility
Log In to add your hardware

4-bit

8-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for Abiray/Sutra-Instruct-350M-GGUF

Quantized
(1)
this model

Collection including Abiray/Sutra-Instruct-350M-GGUF