thumbnail

Qwen3.6-35B-A3B-TQ3_4S

GGUF quantization of Qwen/Qwen3.6-35B-A3B using TQ3_4S with mixed-precision MoE compression — 2-bit experts, 4-bit attention.

Files

File Description
Qwen3.6-35B-A3B-TQ3_4S.gguf Main model (12.4 GiB, 3.07 BPW)
mmproj-BF16.gguf Multimodal projector (BF16)

Quantization

MoE experts tolerate aggressive compression because only 8/256 are active per token. This quantization exploits that asymmetry:

Component Quant Rationale
Expert MLP gate/up Q2_K 98% of params, MoE-tolerant
Expert MLP down Q3_K Write-back sensitivity
Attention Q/K/V/O TQ3_4S WHT-protected
Embeddings + output Q6_K Quality anchor

Runtime Requirement

This model requires the public TurboQuant runtime fork:

Recommended Settings (16GB VRAM)

./build/bin/llama-server \
  -m Qwen3.6-35B-A3B-TQ3_4S.gguf \
  -ngl 99 -c 4096 -np 1 \
  -ctk q4_0 -ctv tq3_0 -fa on \
  --jinja \
  --reasoning off --reasoning-budget 0 --reasoning-format deepseek

With vision:

./build/bin/llama-server \
  -m Qwen3.6-35B-A3B-TQ3_4S.gguf \
  --mmproj mmproj-BF16.gguf \
  -ngl 99 -c 4096 -np 1 \
  -ctk q4_0 -ctv tq3_0 -fa on \
  --jinja --no-mmproj-offload \
  --reasoning off --reasoning-budget 0 --reasoning-format deepseek

Performance (RTX 5060 Ti 16GB)

Metric Value
PP512 1832 tok/s
TG128 107 tok/s
Size 12.4 GiB
BPW 3.07
ngl 99 (full GPU)

Fits entirely in 16GB VRAM — no CPU offload needed.

Quality

10/10 correct on standard QA benchmark (capital of France, 2+2, Python reverse string, gravity, WW2, primes, boiling point, Shakespeare, Jupiter, hello→Hola).

Base Model

License

Apache 2.0 — same as the base model.

Tool Call Validation

Tested with --jinja on both --reasoning off and --reasoning on --reasoning-budget 2048:

Test reasoning off reasoning on
Basic tool call trigger
Tool response → final answer (no loop)
Correct tool selection from multiple
No tool call for simple questions
Multi-step tool use
Nested quote escaping retry (no loop)
Total 10/10 10/10

Recommended settings for tool-use / agentic workflows

--jinja --reasoning off --reasoning-budget 0 --reasoning-format deepseek

Avoid --presence-penalty above 0.5 for tool-use — high values diversify reasoning tokens but don't improve structured JSON output, and can cause repeated near-identical tool calls in agent loops.

If using --reasoning on, ensure your agent framework detects consecutive identical tool calls and breaks after 2-3 retries.

Run tests yourself

chmod +x test_tool_calls.sh
./test_tool_calls.sh 8085
Downloads last month
3,410
GGUF
Model size
35B params
Architecture
qwen35moe
Hardware compatibility
Log In to add your hardware

We're not able to determine the quantization variants.

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for YTan2000/Qwen3.6-35B-A3B-TQ3_4S

Quantized
(156)
this model