Gemma 4 E2B Heretic — Uncensored | MLX 4-bit | Apple Silicon | 3.34 GB
Gemma 4 E2B abliterated via Heretic ARA — 95% refusal removal, near-zero KL divergence. Smallest uncensored multimodal on Apple Silicon. 3.34 GB.
p-e-w/gemma-4-E2B-it-heretic-ara converted to MLX 4-bit (affine, group_size=64) for native Apple Silicon inference.
This is a decensored (abliterated) variant of google/gemma-4-E2B-it, processed with Heretic v1.2.0 using the Arbitrary-Rank Ablation (ARA) method with row-norm preservation — then quantized to 4-bit MLX for fast, private, on-device inference on M-series Macs.
⚡ 3.34 GB — fits in the Neural Engine cache of any M-series Mac.
🖤 Runs fully offline. No API calls. No filters.
Model Details
| Property | Value |
|---|---|
| Base model | google/gemma-4-E2B-it |
| Abliteration | Heretic v1.2.0 via ARA (p-e-w/gemma-4-E2B-it-heretic-ara) |
| Architecture | Gemma4ForConditionalGeneration |
| Parameters | ~2B active |
| Modalities | Text · Vision |
| Quantization | 4-bit affine, group_size=64 |
| File size | 3.34 GB (down from ~9 GB bf16) |
| Context window | 131,072 tokens |
| Vocab size | 262,144 |
| Hidden size | 1,536 |
| Layers | 35 (28× sliding attention + 7× full attention) |
| Attention heads | 8 (KV heads: 1) |
| Sliding window | 512 |
| Vision encoder | 768 hidden · 16 layers · patch 16px |
Abliteration Performance (from base model)
| Metric | This model | Original gemma-4-E2B-it |
|---|---|---|
| KL divergence | 0.1522 | 0 (by definition) |
| Refusals | 5 / 100 | 98 / 100 |
Abliteration parameters used:
| Parameter | Value |
|---|---|
| start_layer_index | 16 |
| end_layer_index | 32 |
| preserve_good_behavior_weight | 0.1887 |
| steer_bad_behavior_weight | 0.0001 |
| overcorrect_relative_weight | 0.6737 |
| neighbor_count | 4 |
Performance (Apple Silicon)
At 3.34 GB this is the fastest model in the RavenX collection — runs on any M-series chip including M1 MacBook Air with RAM to spare.
| Chip | Tok/sec (est) |
|---|---|
| M4 Max 128GB | ~90–120 tok/s |
| M3 Pro 36GB | ~50–70 tok/s |
| M1 MacBook Air 16GB | ~25–40 tok/s |
Quickstart
Install
pip install mlx-lm
Text generation
from mlx_lm import load, generate
model, tokenizer = load("deadbydawn101/gemma-4-E2B-Heretic-Uncensored-mlx-4bit")
messages = [{"role": "user", "content": "Your prompt here."}]
prompt = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
response = generate(model, tokenizer, prompt=prompt, max_tokens=512, verbose=True)
Vision (image + text)
from mlx_lm import load, generate
model, tokenizer = load("deadbydawn101/gemma-4-E2B-Heretic-Uncensored-mlx-4bit")
messages = [{
"role": "user",
"content": [
{"type": "image", "url": "https://example.com/image.jpg"},
{"type": "text", "text": "What do you see?"}
]
}]
prompt = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
response = generate(model, tokenizer, prompt=prompt, max_tokens=512)
CLI
mlx_lm.generate \
--model deadbydawn101/gemma-4-E2B-Heretic-Uncensored-mlx-4bit \
--prompt "Tell me something interesting." \
--max-tokens 256
OpenAI-compatible server
mlx_lm.server \
--model deadbydawn101/gemma-4-E2B-Heretic-Uncensored-mlx-4bit \
--port 8080
Ollama
ollama run hf.co/deadbydawn101/gemma-4-E2B-Heretic-Uncensored-mlx-4bit
Architecture Notes
Gemma 4 E2B uses the same hybrid sliding/full attention design as the larger models:
- 28× sliding attention layers (window=512) — efficient local context
- 7× full attention layers — global coherence at regular intervals
The 2B size makes it ideal for: rapid prototyping, creative writing, local agents, edge deployment, and anything that needs fast uncensored responses without cloud latency.
💻 Gemini CLI — Coding Agent + Tool Orchestration
We use RavenX AI's Gemini CLI fork as the coding agent and tool orchestration layer on top of these models. This is what makes the tool-calling capability real in production.
Gemini CLI gives you a full agentic loop in the terminal — Google Search grounding, file read/write, shell execution, web fetching, and MCP server support — all wired to a 1M token context window.
# Install
npm install -g @google/gemini-cli
# Run as a coding agent against this model (via local mlx_lm server)
mlx_lm.server --model deadbydawn101/gemma-4-E2B-Heretic-Uncensored-mlx-4bit --port 8080 &
gemini --baseUrl http://localhost:8080
# Or use directly against Gemini API (free tier: 60 req/min)
gemini
What Gemini CLI + these models unlock together
| Capability | How |
|---|---|
| Code generation | Gemini CLI reads your codebase, model reasons with <think> tags |
| Tool calling | Native <|tool> tokens → Gemini CLI executes shell/file/web tools |
| Long context | 1M ctx in CLI + TurboQuant 4.6x KV compression = very long sessions |
| MCP servers | Connect any MCP server — databases, APIs, custom tools |
| Search grounding | Google Search built in — model gets live data |
# Real example: code review with tool calling enabled
gemini --baseUrl http://localhost:8080 \
"Review all Python files in ./src, find potential bugs, and suggest fixes"
# Gemini CLI will: read files → call tools → model reasons → produce structured output
→ DeadByDawn101/gemini-cli on GitHub — Apache 2.0, free tier, MCP-compatible
⚡ TurboQuant-MLX — 4.6x KV Cache Compression
Pair this model with TurboQuant-MLX — RavenX AI's Apple Silicon KV cache compression. Run 4.6x longer contexts with near-zero accuracy loss by compressing the KV cache using PolarQuant + QJL residuals.
from turboquant_mlx.mlx_kvcache import TurboQuantKVCache
import mlx_lm.models.cache as cache_module
# Patch mlx-lm to use TurboQuant compression
cache_module.make_prompt_cache = lambda model, **kw: [
TurboQuantKVCache() for _ in range(len(model.layers))
]
# Now load and run as normal — context is compressed automatically
from mlx_vlm import load, generate
model, processor = load("deadbydawn101/gemma-4-E2B-Heretic-Uncensored-mlx-4bit")
| Without TurboQuant | With TurboQuant |
|---|---|
| 8K context @ 12 GB | 36K context @ ~12 GB |
| KV cache grows linearly | KV cache stays compressed |
→ TurboQuant-MLX on GitHub · Release v2.0
🧠 Opus Reasoning + Claude Code LoRA
⚠️ The Opus Reasoning + Claude Code LoRA is NOT compatible with this model.
The LoRA was trained on
gemma-4-E4B(hidden=2,560, 42 layers). This isgemma-4-E2B(hidden=1,536, 35 layers). Different architectures — loading the adapter here will produce incorrect results.If you want Opus reasoning + Claude Code behavior, use the E4B model: → deadbydawn101/gemma-4-E4B-mlx-4bit + LoRA
Conversion Details
- Source:
p-e-w/gemma-4-E2B-it-heretic-ara(bfloat16, ~9 GB) - Tool:
mlx_lm.convertwith--q-bits 4 --q-group-size 64 --q-mode affine - Platform: Apple M4 Max 128GB
- Output: 3.34 GB (4-bit weights + bfloat16 embeddings)
Related Models
| Model | Size | Description |
|---|---|---|
| deadbydawn101/gemma-4-E4B-mlx-4bit | 4.86 GB | Standard 4B — better quality, still fast |
| deadbydawn101/gemma-4-E2B-Heretic-Uncensored-mlx-4bit | 3.34 GB | This model — 2B abliterated, fastest |
License
Apache 2.0 — subject to Gemma Terms of Use.
Converted by deadbydawn101 · RavenX AI
TriAttention KV Compression
[2026-04-09] Our MLX port was merged into TriAttention (MIT + NVIDIA) — PR #1 by @DeadByDawn101 (RavenX AI).
Apply 10.7x KV memory reduction and 2.5x throughput on top of this model's built-in 4-bit TurboQuant quantization for ~50x combined compression vs full fp16:
from mlx_lm import load
from triattention.mlx import apply_triattention_mlx
model, tokenizer = load("deadbydawn101/gemma-4-E2B-Heretic-Uncensored-mlx-4bit")
apply_triattention_mlx(model, kv_budget=2048)
RavenX Inference Harness
One-command inference, benchmarking, and local OpenAI-compatible server:
git clone https://github.com/DeadByDawn101/ravenx-inference-harness
cd ravenx-inference-harness
# Inference
python run.py --model deadbydawn101/gemma-4-E2B-Heretic-Uncensored-mlx-4bit --prompt "Your prompt"
# TriAttention compressed
python run.py --model deadbydawn101/gemma-4-E2B-Heretic-Uncensored-mlx-4bit --triattention --kv-budget 2048
# Local OpenAI-compatible server (works with OpenClaw)
python serve.py --model deadbydawn101/gemma-4-E2B-Heretic-Uncensored-mlx-4bit --triattention
- Downloads last month
- 6,026
4-bit
Model tree for deadbydawn101/gemma-4-E2B-Heretic-Uncensored-mlx-4bit
Base model
p-e-w/gemma-4-E2B-it-heretic-ara