🧠 TinyLlama 1.1B Chat β€” GPTQ Quantized (4bit)

Repo: kivoai/tinyllama-1.1b-chat-gptq
Base Model: TinyLlama/TinyLlama-1.1B-Chat-v1.0
Quantization: GPTQ 4-bit (128g)
Tokenizer: Same as base model (BPE)


πŸš€ Purpose

This model is a 4-bit GPTQ quantized version of the TinyLlama-1.1B-Chat-v1.0 model, optimized for lightweight inference and deployment in decentralized GPU mining environments.

It is currently being used as part of the Neural Subnet protocol for text-generation mining.


🧰 Technical Details

  • Quantized With: AutoGPTQ
  • Quantization Type: 4-bit (group size: 128)
  • Precision: int4
  • Safetensors Format: βœ…
  • Chat Template: Included (chat_template.jinja)
  • Max Sequence Length: 2048

πŸͺ„ Example Usage

from auto_gptq import AutoGPTQForCausalLM
from transformers import AutoTokenizer

model = AutoGPTQForCausalLM.from_quantized(
    "kivoai/tinyllama-1.1b-chat-gptq",
    use_safetensors=True
)
tokenizer = AutoTokenizer.from_pretrained("kivoai/tinyllama-1.1b-chat-gptq")

prompt = "What is the meaning of intelligence?"
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Downloads last month
3
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for kivoai/tinyllama-1.1b-chat-gptq

Quantized
(141)
this model