Ornstein3.6-35B-A3B-GGUF
GGUF quantizations of DJLougen/Ornstein3.6-35B-A3B โ a Qwen 3.6 MoE fine-tune (35B total, ~3B active).
Support This Work
I'm a PhD student in visual neuroscience at the University of Toronto who also happens to spend way too much time fine-tuning, merging, and quantizing open-weight models on rented H100s and a local DGX Spark. All training compute is self-funded โ balancing GPU costs against a student budget. If my uploads have been useful to you, consider buying a PhD student a coffee. It goes a long way toward keeping these experiments running.
Model info
- Architecture:
Qwen3_5MoeForCausalLM(linear + full attention interleaved, Gated Delta Net) - Parameters: 34.66 B total / ~3 B active (256 experts, 8 active per token)
- Context: 262,144 tokens
- Hidden size / layers: 2048 / 40
- Vocab: 248,320 tokens
All sub-8-bit quants were produced with an importance matrix (imatrix) computed from a mixed-domain multilingual calibration corpus (eaddario/imatrix-calibration โ combined_all_medium): 200 chunks ร 512 tokens = 102,400 calibration tokens.
Quant index
Choose a quant that fits in your RAM/VRAM with room for context. For MoE models quality degrades more sharply at low bit widths than for dense models of similar size โ prefer Q4_K_M or higher if you have the memory.
| File | Bits | Size | imatrix | Notes |
|---|---|---|---|---|
Ornstein3.6-35B-A3B-Q8_0.gguf |
8 | 36.9 GB | โ | Reference, near-lossless |
Ornstein3.6-35B-A3B-Q6_K.gguf |
6.5 | 28.5 GB | โ | Great default for 32 GB+ systems |
Ornstein3.6-35B-A3B-Q5_K_M.gguf |
5.5 | 24.7 GB | โ | Excellent quality/size balance |
Ornstein3.6-35B-A3B-Q5_K_S.gguf |
5.5 | 24.0 GB | โ | Slightly smaller Q5 |
Ornstein3.6-35B-A3B-Q4_K_M.gguf |
4.5 | 21.2 GB | โ | Common 24 GB-card default |
Ornstein3.6-35B-A3B-Q4_K_S.gguf |
4.5 | 19.9 GB | โ | Smaller Q4 |
Ornstein3.6-35B-A3B-IQ4_XS.gguf |
4.25 | ~18 GB | โ | Smaller than Q4_K_S, comparable quality with imatrix |
Ornstein3.6-35B-A3B-Q3_K_M.gguf |
3.5 | 16.8 GB | โ | Usable; quality below Q4 |
Ornstein3.6-35B-A3B-Q3_K_S.gguf |
3.5 | 15.2 GB | โ | Smaller Q3 |
Ornstein3.6-35B-A3B-IQ3_M.gguf |
3.3 | ~15 GB | โ | Mixed I-quant, beats Q3_K_S at similar size |
Ornstein3.6-35B-A3B-IQ3_XXS.gguf |
3.0 | ~13 GB | โ | Aggressive 3-bit |
Ornstein3.6-35B-A3B-Q2_K.gguf |
2.6 | 12.9 GB | โ | Lowest K-quant; expect degraded quality |
Ornstein3.6-35B-A3B-IQ2_M.gguf |
2.7 | ~12 GB | โ | Aggressive I-quant 2-bit |
imatrix.dat |
โ | 192 MB | โ | Importance matrix (GGUF format) |
Usage
llama.cpp
# Interactive chat
llama-cli -m Ornstein3.6-35B-A3B-Q4_K_M.gguf -cnv
# Single prompt
llama-cli -m Ornstein3.6-35B-A3B-Q5_K_M.gguf -p "Write a haiku about MoE routing."
# OpenAI-compatible server
llama-server -m Ornstein3.6-35B-A3B-Q4_K_M.gguf --host 0.0.0.0 --port 8080 -c 8192
Other runners
LM Studio, Ollama (via a Modelfile), koboldcpp, and text-generation-webui all load these GGUFs provided their bundled llama.cpp supports Qwen3_5MoeForCausalLM with Gated Delta Net.
Reproducing the quants
# 1. Convert safetensors โ BF16 GGUF
python llama.cpp/convert_hf_to_gguf.py <model_dir> \
--outtype bf16 --outfile Ornstein3.6-35B-A3B-BF16.gguf
# 2. Importance matrix
llama-imatrix \
-m Ornstein3.6-35B-A3B-BF16.gguf \
-f calibration.txt \
-o imatrix.dat \
--chunks 200 -c 512 -b 512 -ngl 99
# 3. Quantize (example)
llama-quantize --imatrix imatrix.dat \
Ornstein3.6-35B-A3B-BF16.gguf \
Ornstein3.6-35B-A3B-Q4_K_M.gguf Q4_K_M
License
Apache 2.0 โ inherited from the Qwen 3.6 base release.
- Downloads last month
- 2,692
2-bit
3-bit
4-bit
5-bit
6-bit
8-bit
Model tree for DJLougen/Ornstein3.6-35B-A3B-GGUF
Base model
Qwen/Qwen3.6-35B-A3B