🧠 TinyLlama 1.1B Chat — GPTQ Quantized (4bit)

Repo: kivoai/tinyllama-1.1b-chat-gptq
Base Model: TinyLlama/TinyLlama-1.1B-Chat-v1.0
Quantization: GPTQ 4-bit (128g)
Tokenizer: Same as base model (BPE)

🚀 Purpose

This model is a 4-bit GPTQ quantized version of the TinyLlama-1.1B-Chat-v1.0 model, optimized for lightweight inference and deployment in decentralized GPU mining environments.

It is currently being used as part of the Neural Subnet protocol for text-generation mining.

🧰 Technical Details

Quantized With: AutoGPTQ
Quantization Type: 4-bit (group size: 128)
Precision: int4
Safetensors Format: ✅
Chat Template: Included (chat_template.jinja)
Max Sequence Length: 2048

🪄 Example Usage

from auto_gptq import AutoGPTQForCausalLM
from transformers import AutoTokenizer

model = AutoGPTQForCausalLM.from_quantized(
    "kivoai/tinyllama-1.1b-chat-gptq",
    use_safetensors=True
)
tokenizer = AutoTokenizer.from_pretrained("kivoai/tinyllama-1.1b-chat-gptq")

prompt = "What is the meaning of intelligence?"
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Downloads last month: 3

Model tree for kivoai/tinyllama-1.1b-chat-gptq

Base model

TinyLlama/TinyLlama-1.1B-Chat-v1.0

Quantized

(141)

this model