🧠 DeepGemma-2B-Reasoning

DeepGemma-2B-Reasoning is a deeply fine-tuned version of google/gemma-4-E2B-it, optimized for step-by-step reasoning (Chain-of-Thought). Trained via knowledge distillation on datasets generated by Claude Opus, Qwen3.5, and KIMI.

The model generates internal reasoning inside <thought> / <think> tags before answering, significantly improving logical and mathematical response quality.

🏆 Benchmarks (GSM8K)

Model GSM8K (Accuracy) Improvement
google/gemma-4-E2B-it (Base) 30.0% -
DeepGemma-2B-Reasoning (Ours) 44.0% +14.0% 🚀

🛠 Training Details

Training was conducted using Unsloth (QLoRA) on an RTX 4090 (24GB VRAM).

  • Method: QLoRA (4-bit quantization, BF16 adapters)
  • LoRA Parameters: Rank = 48, Alpha = 48
  • Epochs: 2 | Global Steps: 4672 | Learning Rate: 2e-4
  • Final Loss: 1.24

🗜️ GGUF Version (llama.cpp)

A quantized Q4_K_M GGUF version is available directly in this repo.

File: gemma4_e2b-q4_k_m.gguf (~4.7 GB) Quantization: llama.cpp Q4_K_M Merge: Full LoRA merge before quantization (Unsloth)

⚡ Performance (RTX 4090, llama.cpp, ngl=999)

Metric Speed
Prompt processing ~400 tok/s
Generation ~239–262 tok/s
Context 4096 tokens

Usage (llama.cpp)

./llama-cli -m gemma4_e2b-q4_k_m.gguf \
  -p "<start_of_turn>user\nYour question here<end_of_turn>\n<start_of_turn>model\n" \
  -n 512 -ngl 999 -c 4096

💻 Usage (Transformers / Unsloth)

from unsloth import FastVisionModel

model, tokenizer = FastVisionModel.from_pretrained(
    model_name="Zhantas/DeepGemma-2B-Reasoning",
    max_seq_length=2048,
    load_in_4bit=True,
)
FastVisionModel.for_inference(model)

question = "I had 3 apples. I ate one, and then bought as many as I had left. How many apples do I have now? Reason step by step."
prompt = f"<start_of_turn>user\n{question}<end_of_turn>\n<start_of_turn>model\n"

inputs = tokenizer(text=prompt, return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=1024, temperature=0.3, repetition_penalty=1.1)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))

⚠️ Limitations

Prone to "overthinking" on simple tasks. Best suited for logic puzzles, coding, and mathematics.

Downloads last month
59
GGUF
Model size
5B params
Architecture
gemma4
Hardware compatibility
Log In to add your hardware

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Zhantas/DeepGemma-2B-Reasoning

Quantized
(132)
this model
Quantizations
1 model

Datasets used to train Zhantas/DeepGemma-2B-Reasoning