Gemma 4 31B - GGUF
This repository contains highly optimized GGUF format quantized weights for Google's Gemma 4 31B.
These weights were compiled using the bleeding-edge master branch of llama.cpp to ensure full architectural compatibility and natively support high-speed inference on modern hardware, including Blackwell architecture (RTX 50-series) and advanced Apple Silicon.
📦 Available Quantization Formats
Below are multiple quantization "flavors" to best match your hardware capabilities.
| File Name | Bit-Rate | Size | Target VRAM / RAM | Description |
|---|---|---|---|---|
gemma-4-31b-Q8_0.gguf |
8-bit | ~32.1 GB | 36 GB+ | Purest quality, zero noticeable logic loss. |
gemma-4-31b-Q6_K.gguf |
6-bit | ~25.5 GB | 28 GB+ | Near-perfect reasoning retention. Fits perfectly on 32GB GPUs. |
gemma-4-31b-Q5_K_M.gguf |
5-bit | ~22.5 GB | 24 GB+ | High precision, ideal for coding and complex math. |
gemma-4-31b-Q4_K_M.gguf |
4-bit | ~18.2 GB | 20 GB+ | Recommended. The sweet spot for 24GB GPUs. |
Note: Context windows (e.g., 32K or the maximum 256K) will require additional VRAM allocation for the KV Cache.
🚀 How to Use
These GGUF files are designed to be run locally using llama.cpp or any compatible downstream UI (e.g., LM Studio, Ollama, or custom Gradio WebUIs).
💻 Command Line (llama.cpp)
To run the recommended Q4_K_M model with full GPU offloading and a 32K context window:
./llama-cli -m gemma-4-31b-Q4_K_M.gguf -n 2048 -c 32768 -ngl 999 -p "You are an expert AI assistant. Explain quantum entanglement."
🐍 Python (llama-cpp-python)
from llama_cpp import Llama
# Load the model with Blackwell-optimized Flash Attention
llm = Llama(
model_path="./gemma-4-31b-Q4_K_M.gguf",
n_gpu_layers=-1, # Offload entirely to GPU
n_ctx=32768, # 32K Context Window
flash_attn=True
)
response = llm.create_chat_completion(
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Write a python script to calculate the Fibonacci sequence."}
]
)
print(response["choices"][0]["message"]["content"])
⚖️ License & Acknowledgements
These weights are derivative works of Google's Gemma 4 31B model. They are distributed under the Apache 2.0 License. All credit for the underlying neural architecture and base training data goes to the Google DeepMind team.
- Downloads last month
- 258
4-bit
5-bit
6-bit
8-bit