Gemma 4 E2B Heretic — Uncensored | MLX 4-bit | Apple Silicon | 3.34 GB

Gemma 4 E2B abliterated via Heretic ARA — 95% refusal removal, near-zero KL divergence. Smallest uncensored multimodal on Apple Silicon. 3.34 GB.

p-e-w/gemma-4-E2B-it-heretic-ara converted to MLX 4-bit (affine, group_size=64) for native Apple Silicon inference.

This is a decensored (abliterated) variant of google/gemma-4-E2B-it, processed with Heretic v1.2.0 using the Arbitrary-Rank Ablation (ARA) method with row-norm preservation — then quantized to 4-bit MLX for fast, private, on-device inference on M-series Macs.

⚡ 3.34 GB — fits in the Neural Engine cache of any M-series Mac.
🖤 Runs fully offline. No API calls. No filters.

Model Details

Property	Value
Base model	google/gemma-4-E2B-it
Abliteration	Heretic v1.2.0 via ARA (p-e-w/gemma-4-E2B-it-heretic-ara)
Architecture	Gemma4ForConditionalGeneration
Parameters	~2B active
Modalities	Text · Vision
Quantization	4-bit affine, group_size=64
File size	3.34 GB (down from ~9 GB bf16)
Context window	131,072 tokens
Vocab size	262,144
Hidden size	1,536
Layers	35 (28× sliding attention + 7× full attention)
Attention heads	8 (KV heads: 1)
Sliding window	512
Vision encoder	768 hidden · 16 layers · patch 16px

Abliteration Performance (from base model)

Metric	This model	Original gemma-4-E2B-it
KL divergence	0.1522	0 (by definition)
Refusals	5 / 100	98 / 100

Abliteration parameters used:

Parameter	Value
start_layer_index	16
end_layer_index	32
preserve_good_behavior_weight	0.1887
steer_bad_behavior_weight	0.0001
overcorrect_relative_weight	0.6737
neighbor_count	4

Performance (Apple Silicon)

At 3.34 GB this is the fastest model in the RavenX collection — runs on any M-series chip including M1 MacBook Air with RAM to spare.

Chip	Tok/sec (est)
M4 Max 128GB	~90–120 tok/s
M3 Pro 36GB	~50–70 tok/s
M1 MacBook Air 16GB	~25–40 tok/s

Quickstart

Install

pip install mlx-lm

Text generation

from mlx_lm import load, generate

model, tokenizer = load("deadbydawn101/gemma-4-E2B-Heretic-Uncensored-mlx-4bit")

messages = [{"role": "user", "content": "Your prompt here."}]
prompt = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
response = generate(model, tokenizer, prompt=prompt, max_tokens=512, verbose=True)

Vision (image + text)

from mlx_lm import load, generate

model, tokenizer = load("deadbydawn101/gemma-4-E2B-Heretic-Uncensored-mlx-4bit")

messages = [{
    "role": "user",
    "content": [
        {"type": "image", "url": "https://example.com/image.jpg"},
        {"type": "text",  "text": "What do you see?"}
    ]
}]
prompt = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
response = generate(model, tokenizer, prompt=prompt, max_tokens=512)

CLI

mlx_lm.generate \
  --model deadbydawn101/gemma-4-E2B-Heretic-Uncensored-mlx-4bit \
  --prompt "Tell me something interesting." \
  --max-tokens 256

OpenAI-compatible server

mlx_lm.server \
  --model deadbydawn101/gemma-4-E2B-Heretic-Uncensored-mlx-4bit \
  --port 8080

Ollama

ollama run hf.co/deadbydawn101/gemma-4-E2B-Heretic-Uncensored-mlx-4bit

Architecture Notes

Gemma 4 E2B uses the same hybrid sliding/full attention design as the larger models:

28× sliding attention layers (window=512) — efficient local context
7× full attention layers — global coherence at regular intervals

The 2B size makes it ideal for: rapid prototyping, creative writing, local agents, edge deployment, and anything that needs fast uncensored responses without cloud latency.

💻 Gemini CLI — Coding Agent + Tool Orchestration

We use RavenX AI's Gemini CLI fork as the coding agent and tool orchestration layer on top of these models. This is what makes the tool-calling capability real in production.

Gemini CLI gives you a full agentic loop in the terminal — Google Search grounding, file read/write, shell execution, web fetching, and MCP server support — all wired to a 1M token context window.

# Install
npm install -g @google/gemini-cli

# Run as a coding agent against this model (via local mlx_lm server)
mlx_lm.server --model deadbydawn101/gemma-4-E2B-Heretic-Uncensored-mlx-4bit --port 8080 &
gemini --baseUrl http://localhost:8080

# Or use directly against Gemini API (free tier: 60 req/min)
gemini

What Gemini CLI + these models unlock together

Capability	How
Code generation	Gemini CLI reads your codebase, model reasons with `<think>` tags
Tool calling	Native `<\|tool>` tokens → Gemini CLI executes shell/file/web tools
Long context	1M ctx in CLI + TurboQuant 4.6x KV compression = very long sessions
MCP servers	Connect any MCP server — databases, APIs, custom tools
Search grounding	Google Search built in — model gets live data

# Real example: code review with tool calling enabled
gemini --baseUrl http://localhost:8080 \
  "Review all Python files in ./src, find potential bugs, and suggest fixes"

# Gemini CLI will: read files → call tools → model reasons → produce structured output

→ DeadByDawn101/gemini-cli on GitHub — Apache 2.0, free tier, MCP-compatible

⚡ TurboQuant-MLX — 4.6x KV Cache Compression

Pair this model with TurboQuant-MLX — RavenX AI's Apple Silicon KV cache compression. Run 4.6x longer contexts with near-zero accuracy loss by compressing the KV cache using PolarQuant + QJL residuals.

from turboquant_mlx.mlx_kvcache import TurboQuantKVCache
import mlx_lm.models.cache as cache_module

# Patch mlx-lm to use TurboQuant compression
cache_module.make_prompt_cache = lambda model, **kw: [
    TurboQuantKVCache() for _ in range(len(model.layers))
]

# Now load and run as normal — context is compressed automatically
from mlx_vlm import load, generate
model, processor = load("deadbydawn101/gemma-4-E2B-Heretic-Uncensored-mlx-4bit")

Without TurboQuant	With TurboQuant
8K context @ 12 GB	36K context @ ~12 GB
KV cache grows linearly	KV cache stays compressed

→ TurboQuant-MLX on GitHub · Release v2.0

🧠 Opus Reasoning + Claude Code LoRA

⚠️ The Opus Reasoning + Claude Code LoRA is NOT compatible with this model.

The LoRA was trained on gemma-4-E4B (hidden=2,560, 42 layers). This is gemma-4-E2B (hidden=1,536, 35 layers). Different architectures — loading the adapter here will produce incorrect results.

If you want Opus reasoning + Claude Code behavior, use the E4B model: → deadbydawn101/gemma-4-E4B-mlx-4bit + LoRA

Conversion Details

Source: p-e-w/gemma-4-E2B-it-heretic-ara (bfloat16, ~9 GB)
Tool: mlx_lm.convert with --q-bits 4 --q-group-size 64 --q-mode affine
Platform: Apple M4 Max 128GB
Output: 3.34 GB (4-bit weights + bfloat16 embeddings)

Related Models

Model	Size	Description
deadbydawn101/gemma-4-E4B-mlx-4bit	4.86 GB	Standard 4B — better quality, still fast
deadbydawn101/gemma-4-E2B-Heretic-Uncensored-mlx-4bit	3.34 GB	This model — 2B abliterated, fastest

License

Apache 2.0 — subject to Gemma Terms of Use.

Converted by deadbydawn101 · RavenX AI

TriAttention KV Compression

[2026-04-09] Our MLX port was merged into TriAttention (MIT + NVIDIA) — PR #1 by @DeadByDawn101 (RavenX AI).

Apply 10.7x KV memory reduction and 2.5x throughput on top of this model's built-in 4-bit TurboQuant quantization for ~50x combined compression vs full fp16:

from mlx_lm import load
from triattention.mlx import apply_triattention_mlx

model, tokenizer = load("deadbydawn101/gemma-4-E2B-Heretic-Uncensored-mlx-4bit")
apply_triattention_mlx(model, kv_budget=2048)

RavenX Inference Harness

One-command inference, benchmarking, and local OpenAI-compatible server:

git clone https://github.com/DeadByDawn101/ravenx-inference-harness
cd ravenx-inference-harness

# Inference
python run.py --model deadbydawn101/gemma-4-E2B-Heretic-Uncensored-mlx-4bit --prompt "Your prompt"

# TriAttention compressed
python run.py --model deadbydawn101/gemma-4-E2B-Heretic-Uncensored-mlx-4bit --triattention --kv-budget 2048

# Local OpenAI-compatible server (works with OpenClaw)
python serve.py --model deadbydawn101/gemma-4-E2B-Heretic-Uncensored-mlx-4bit --triattention

Downloads last month: 6,026

Safetensors

Model size

1B params

Tensor type

BF16

U32

MLX

Hardware compatibility

4-bit

Model tree for deadbydawn101/gemma-4-E2B-Heretic-Uncensored-mlx-4bit

Base model

p-e-w/gemma-4-E2B-it-heretic-ara

Quantized

(11)

this model

Collection including deadbydawn101/gemma-4-E2B-Heretic-Uncensored-mlx-4bit

RavenX MLX Models — Apple Silicon Inference Stack

Collection

TurboQuant 4-bit mlx-lm models. TriAttention compatible. PR #1 merged MIT+NVIDIA. • 7 items • Updated 11 days ago • 1