Ornstein3.6-35B-A3B-GGUF

GGUF quantizations of DJLougen/Ornstein3.6-35B-A3B — a Qwen 3.6 MoE fine-tune (35B total, ~3B active).

Support This Work

I'm a PhD student in visual neuroscience at the University of Toronto who also happens to spend way too much time fine-tuning, merging, and quantizing open-weight models on rented H100s and a local DGX Spark. All training compute is self-funded — balancing GPU costs against a student budget. If my uploads have been useful to you, consider buying a PhD student a coffee. It goes a long way toward keeping these experiments running.

Support on Ko-fi

Model info

Architecture: Qwen3_5MoeForCausalLM (linear + full attention interleaved, Gated Delta Net)
Parameters: 34.66 B total / ~3 B active (256 experts, 8 active per token)
Context: 262,144 tokens
Hidden size / layers: 2048 / 40
Vocab: 248,320 tokens

All sub-8-bit quants were produced with an importance matrix (imatrix) computed from a mixed-domain multilingual calibration corpus (eaddario/imatrix-calibration → combined_all_medium): 200 chunks × 512 tokens = 102,400 calibration tokens.

Quant index

Choose a quant that fits in your RAM/VRAM with room for context. For MoE models quality degrades more sharply at low bit widths than for dense models of similar size — prefer Q4_K_M or higher if you have the memory.

File	Bits	Size	imatrix	Notes
`Ornstein3.6-35B-A3B-Q8_0.gguf`	8	36.9 GB	—	Reference, near-lossless
`Ornstein3.6-35B-A3B-Q6_K.gguf`	6.5	28.5 GB	—	Great default for 32 GB+ systems
`Ornstein3.6-35B-A3B-Q5_K_M.gguf`	5.5	24.7 GB	✓	Excellent quality/size balance
`Ornstein3.6-35B-A3B-Q5_K_S.gguf`	5.5	24.0 GB	✓	Slightly smaller Q5
`Ornstein3.6-35B-A3B-Q4_K_M.gguf`	4.5	21.2 GB	✓	Common 24 GB-card default
`Ornstein3.6-35B-A3B-Q4_K_S.gguf`	4.5	19.9 GB	✓	Smaller Q4
`Ornstein3.6-35B-A3B-IQ4_XS.gguf`	4.25	~18 GB	✓	Smaller than Q4_K_S, comparable quality with imatrix
`Ornstein3.6-35B-A3B-Q3_K_M.gguf`	3.5	16.8 GB	✓	Usable; quality below Q4
`Ornstein3.6-35B-A3B-Q3_K_S.gguf`	3.5	15.2 GB	✓	Smaller Q3
`Ornstein3.6-35B-A3B-IQ3_M.gguf`	3.3	~15 GB	✓	Mixed I-quant, beats Q3_K_S at similar size
`Ornstein3.6-35B-A3B-IQ3_XXS.gguf`	3.0	~13 GB	✓	Aggressive 3-bit
`Ornstein3.6-35B-A3B-Q2_K.gguf`	2.6	12.9 GB	✓	Lowest K-quant; expect degraded quality
`Ornstein3.6-35B-A3B-IQ2_M.gguf`	2.7	~12 GB	✓	Aggressive I-quant 2-bit
`imatrix.dat`	—	192 MB	—	Importance matrix (GGUF format)

Usage

llama.cpp

# Interactive chat
llama-cli -m Ornstein3.6-35B-A3B-Q4_K_M.gguf -cnv

# Single prompt
llama-cli -m Ornstein3.6-35B-A3B-Q5_K_M.gguf -p "Write a haiku about MoE routing."

# OpenAI-compatible server
llama-server -m Ornstein3.6-35B-A3B-Q4_K_M.gguf --host 0.0.0.0 --port 8080 -c 8192

Other runners

LM Studio, Ollama (via a Modelfile), koboldcpp, and text-generation-webui all load these GGUFs provided their bundled llama.cpp supports Qwen3_5MoeForCausalLM with Gated Delta Net.

Reproducing the quants

# 1. Convert safetensors → BF16 GGUF
python llama.cpp/convert_hf_to_gguf.py <model_dir> \
    --outtype bf16 --outfile Ornstein3.6-35B-A3B-BF16.gguf

# 2. Importance matrix
llama-imatrix \
    -m Ornstein3.6-35B-A3B-BF16.gguf \
    -f calibration.txt \
    -o imatrix.dat \
    --chunks 200 -c 512 -b 512 -ngl 99

# 3. Quantize (example)
llama-quantize --imatrix imatrix.dat \
    Ornstein3.6-35B-A3B-BF16.gguf \
    Ornstein3.6-35B-A3B-Q4_K_M.gguf Q4_K_M

License

Apache 2.0 — inherited from the Qwen 3.6 base release.

Downloads last month: 2,692

GGUF

Model size

35B params

Architecture

qwen35moe

Hardware compatibility

2-bit

3-bit

4-bit

5-bit

6-bit

8-bit

Model tree for DJLougen/Ornstein3.6-35B-A3B-GGUF

Base model

Qwen/Qwen3.6-35B-A3B

Finetuned

unsloth/Qwen3.6-35B-A3B

Finetuned

DJLougen/Ornstein3.6-35B-A3B

Quantized

(5)

this model