Sutra-Instruct-350M-GGUF
This repository contains the official GGUF quantizations of Sutra-Instruct-350M, a 350-million parameter model trained entirely from scratch by Abhiray.
The Sutra (ΰ€Έΰ₯ΰ€€ΰ₯ΰ€°) series is designed to be a high-speed, rule-based instruction model. These GGUF versions are optimized for ultra-low latency inference on consumer-grade hardware, including CPUs and mobile devices.
π Available Quantizations
| File Name | Quant Method | Size | Description |
|---|---|---|---|
| sutra-fp16.gguf | None (F16) | ~779 MB | Original weights. RECOMMENDED MOST Best for maximum quality and accuracy. |
| sutra-Q8_0.gguf | Q8_0 | ~417 MB | Standard 8-bit. Near-lossless; RECOMMENDED for most users. |
| sutra-Q4_K_M.gguf | Q4_K_M | ~258 MB | 4-bit Medium. Its for just testing not recommended much. |
π Quick Start (Inference)
1. Using llama.cpp
Run the model directly from your terminal using llama-cli:
./llama-cli -m sutra-Q8_0.gguf -p "Instruction: What is the law of gravity?\n\nResponse:" -n 400 --temp 0.55 --repeat-penalty 1.2
2. Using Ollama (Plug-and-Play)
To use this with Ollama, create a file named Modelfile with the following content:
FROM ./sutra-Q8_0.gguf TEMPLATE """Instruction: {{ .Prompt }}
Response: """ PARAMETER stop "Instruction:" PARAMETER stop "\nInstruction:" PARAMETER temperature 0.55 PARAMETER repeat_penalty 1.2 PARAMETER top_k 50 PARAMETER num_predict 512
Then, initialize and run the model:
ollama create sutra -f Modelfile ollama run sutra
βοΈ Optimized Settings To prevent the model from looping or hallucinating, we strongly recommend these inference parameters:
Temperature: 0.55
Repeat Penalty: 1.2
Top-K: 50
Max Tokens: 512
- Downloads last month
- 1,089
Model tree for Abiray/Sutra-Instruct-350M-GGUF
Base model
Abiray/Sutra-Instruct-350M