Gemma 4 31B - GGUF

This repository contains highly optimized GGUF format quantized weights for Google's Gemma 4 31B.

These weights were compiled using the bleeding-edge master branch of llama.cpp to ensure full architectural compatibility and natively support high-speed inference on modern hardware, including Blackwell architecture (RTX 50-series) and advanced Apple Silicon.

📦 Available Quantization Formats

Below are multiple quantization "flavors" to best match your hardware capabilities.

File Name	Bit-Rate	Size	Target VRAM / RAM	Description
`gemma-4-31b-Q8_0.gguf`	8-bit	~32.1 GB	36 GB+	Purest quality, zero noticeable logic loss.
`gemma-4-31b-Q6_K.gguf`	6-bit	~25.5 GB	28 GB+	Near-perfect reasoning retention. Fits perfectly on 32GB GPUs.
`gemma-4-31b-Q5_K_M.gguf`	5-bit	~22.5 GB	24 GB+	High precision, ideal for coding and complex math.
`gemma-4-31b-Q4_K_M.gguf`	4-bit	~18.2 GB	20 GB+	Recommended. The sweet spot for 24GB GPUs.

Note: Context windows (e.g., 32K or the maximum 256K) will require additional VRAM allocation for the KV Cache.

🚀 How to Use

These GGUF files are designed to be run locally using llama.cpp or any compatible downstream UI (e.g., LM Studio, Ollama, or custom Gradio WebUIs).

💻 Command Line (llama.cpp)

To run the recommended Q4_K_M model with full GPU offloading and a 32K context window:

./llama-cli -m gemma-4-31b-Q4_K_M.gguf -n 2048 -c 32768 -ngl 999 -p "You are an expert AI assistant. Explain quantum entanglement."

🐍 Python (llama-cpp-python)

from llama_cpp import Llama

# Load the model with Blackwell-optimized Flash Attention
llm = Llama(
    model_path="./gemma-4-31b-Q4_K_M.gguf",
    n_gpu_layers=-1, # Offload entirely to GPU
    n_ctx=32768,     # 32K Context Window
    flash_attn=True
)

response = llm.create_chat_completion(
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Write a python script to calculate the Fibonacci sequence."}
    ]
)

print(response["choices"][0]["message"]["content"])

⚖️ License & Acknowledgements

These weights are derivative works of Google's Gemma 4 31B model. They are distributed under the Apache 2.0 License. All credit for the underlying neural architecture and base training data goes to the Google DeepMind team.

Downloads last month: 448

GGUF

Model size

31B params

Architecture

gemma4

Hardware compatibility

4-bit

5-bit

6-bit

8-bit