gemma-3-1b-it-Q8_0-GGUF

GGUF Q8_0 quantization of google/gemma-3-1b-it, converted and quantized from scratch using llama.cpp.

Quantization

Step Tool Input Output
1 convert_hf_to_gguf.py BF16 safetensors F16.gguf
2 llama-quantize F16.gguf Q8_0.gguf

Files

File Size Description
gemma-3-1b-it-Q8_0.gguf 1.07 GB Q8_0 โ€” 8-bit quantization, 8.50 BPW

Benchmark (Google Colab T4)

Model Prefill Throughput VRAM
BF16 (transformers) 226 ms 8.7 tok/s 3334 MB
INT8 BitsAndBytes 245 ms 5.2 tok/s 1329 MB
GGUF Q8_0 (this) 83 ms 107 tok/s ~1100 MB

Usage

from llama_cpp import Llama

llm = Llama.from_pretrained(
    repo_id="MichaelLowrance/gemma-3-1b-it-Q8_0-GGUF",
    filename="gemma-3-1b-it-Q8_0.gguf",
    n_gpu_layers=-1,
    n_ctx=2048,
)

output = llm(
    "<start_of_turn>user\nHello!<end_of_turn>\n<start_of_turn>model\n",
    max_tokens=100,
    temperature=0.7,
)
print(output["choices"][0]["text"])

Notes

  • Outputs verified to be 100% identical to bartowski/google_gemma-3-1b-it-GGUF Q8_0
  • f32 tensors (norms, embeddings scale) left in fp32 as per llama.cpp defaults
  • Built with llama.cpp (latest main branch)
Downloads last month
13
GGUF
Model size
1.0B params
Architecture
gemma3
Hardware compatibility
Log In to add your hardware

8-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for dahus/gemma-3-1b-it-Q8_0-GGUF

Quantized
(186)
this model