Ornstein3.6-35B-A3B

Ornstein3.6-35B-A3B-GGUF

GGUF quantizations of DJLougen/Ornstein3.6-35B-A3B โ€” a Qwen 3.6 MoE fine-tune (35B total, ~3B active).

Support This Work

I'm a PhD student in visual neuroscience at the University of Toronto who also happens to spend way too much time fine-tuning, merging, and quantizing open-weight models on rented H100s and a local DGX Spark. All training compute is self-funded โ€” balancing GPU costs against a student budget. If my uploads have been useful to you, consider buying a PhD student a coffee. It goes a long way toward keeping these experiments running.

Support on Ko-fi


Model info

  • Architecture: Qwen3_5MoeForCausalLM (linear + full attention interleaved, Gated Delta Net)
  • Parameters: 34.66 B total / ~3 B active (256 experts, 8 active per token)
  • Context: 262,144 tokens
  • Hidden size / layers: 2048 / 40
  • Vocab: 248,320 tokens

All sub-8-bit quants were produced with an importance matrix (imatrix) computed from a mixed-domain multilingual calibration corpus (eaddario/imatrix-calibration โ†’ combined_all_medium): 200 chunks ร— 512 tokens = 102,400 calibration tokens.

Quant index

Choose a quant that fits in your RAM/VRAM with room for context. For MoE models quality degrades more sharply at low bit widths than for dense models of similar size โ€” prefer Q4_K_M or higher if you have the memory.

File Bits Size imatrix Notes
Ornstein3.6-35B-A3B-Q8_0.gguf 8 36.9 GB โ€” Reference, near-lossless
Ornstein3.6-35B-A3B-Q6_K.gguf 6.5 28.5 GB โ€” Great default for 32 GB+ systems
Ornstein3.6-35B-A3B-Q5_K_M.gguf 5.5 24.7 GB โœ“ Excellent quality/size balance
Ornstein3.6-35B-A3B-Q5_K_S.gguf 5.5 24.0 GB โœ“ Slightly smaller Q5
Ornstein3.6-35B-A3B-Q4_K_M.gguf 4.5 21.2 GB โœ“ Common 24 GB-card default
Ornstein3.6-35B-A3B-Q4_K_S.gguf 4.5 19.9 GB โœ“ Smaller Q4
Ornstein3.6-35B-A3B-IQ4_XS.gguf 4.25 ~18 GB โœ“ Smaller than Q4_K_S, comparable quality with imatrix
Ornstein3.6-35B-A3B-Q3_K_M.gguf 3.5 16.8 GB โœ“ Usable; quality below Q4
Ornstein3.6-35B-A3B-Q3_K_S.gguf 3.5 15.2 GB โœ“ Smaller Q3
Ornstein3.6-35B-A3B-IQ3_M.gguf 3.3 ~15 GB โœ“ Mixed I-quant, beats Q3_K_S at similar size
Ornstein3.6-35B-A3B-IQ3_XXS.gguf 3.0 ~13 GB โœ“ Aggressive 3-bit
Ornstein3.6-35B-A3B-Q2_K.gguf 2.6 12.9 GB โœ“ Lowest K-quant; expect degraded quality
Ornstein3.6-35B-A3B-IQ2_M.gguf 2.7 ~12 GB โœ“ Aggressive I-quant 2-bit
imatrix.dat โ€” 192 MB โ€” Importance matrix (GGUF format)

Usage

llama.cpp

# Interactive chat
llama-cli -m Ornstein3.6-35B-A3B-Q4_K_M.gguf -cnv

# Single prompt
llama-cli -m Ornstein3.6-35B-A3B-Q5_K_M.gguf -p "Write a haiku about MoE routing."

# OpenAI-compatible server
llama-server -m Ornstein3.6-35B-A3B-Q4_K_M.gguf --host 0.0.0.0 --port 8080 -c 8192

Other runners

LM Studio, Ollama (via a Modelfile), koboldcpp, and text-generation-webui all load these GGUFs provided their bundled llama.cpp supports Qwen3_5MoeForCausalLM with Gated Delta Net.

Reproducing the quants

# 1. Convert safetensors โ†’ BF16 GGUF
python llama.cpp/convert_hf_to_gguf.py <model_dir> \
    --outtype bf16 --outfile Ornstein3.6-35B-A3B-BF16.gguf

# 2. Importance matrix
llama-imatrix \
    -m Ornstein3.6-35B-A3B-BF16.gguf \
    -f calibration.txt \
    -o imatrix.dat \
    --chunks 200 -c 512 -b 512 -ngl 99

# 3. Quantize (example)
llama-quantize --imatrix imatrix.dat \
    Ornstein3.6-35B-A3B-BF16.gguf \
    Ornstein3.6-35B-A3B-Q4_K_M.gguf Q4_K_M

License

Apache 2.0 โ€” inherited from the Qwen 3.6 base release.

Downloads last month
2,692
GGUF
Model size
35B params
Architecture
qwen35moe
Hardware compatibility
Log In to add your hardware

2-bit

3-bit

4-bit

5-bit

6-bit

8-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for DJLougen/Ornstein3.6-35B-A3B-GGUF

Quantized
(5)
this model