Qwen3.6-35B-A3B-TQ3_4S
GGUF quantization of Qwen/Qwen3.6-35B-A3B using TQ3_4S with mixed-precision MoE compression — 2-bit experts, 4-bit attention.
Files
| File | Description |
|---|---|
Qwen3.6-35B-A3B-TQ3_4S.gguf |
Main model (12.4 GiB, 3.07 BPW) |
mmproj-BF16.gguf |
Multimodal projector (BF16) |
Quantization
MoE experts tolerate aggressive compression because only 8/256 are active per token. This quantization exploits that asymmetry:
| Component | Quant | Rationale |
|---|---|---|
| Expert MLP gate/up | Q2_K | 98% of params, MoE-tolerant |
| Expert MLP down | Q3_K | Write-back sensitivity |
| Attention Q/K/V/O | TQ3_4S | WHT-protected |
| Embeddings + output | Q6_K | Quality anchor |
Runtime Requirement
This model requires the public TurboQuant runtime fork:
Recommended Settings (16GB VRAM)
./build/bin/llama-server \
-m Qwen3.6-35B-A3B-TQ3_4S.gguf \
-ngl 99 -c 4096 -np 1 \
-ctk q4_0 -ctv tq3_0 -fa on \
--jinja \
--reasoning off --reasoning-budget 0 --reasoning-format deepseek
With vision:
./build/bin/llama-server \
-m Qwen3.6-35B-A3B-TQ3_4S.gguf \
--mmproj mmproj-BF16.gguf \
-ngl 99 -c 4096 -np 1 \
-ctk q4_0 -ctv tq3_0 -fa on \
--jinja --no-mmproj-offload \
--reasoning off --reasoning-budget 0 --reasoning-format deepseek
Performance (RTX 5060 Ti 16GB)
| Metric | Value |
|---|---|
| PP512 | 1832 tok/s |
| TG128 | 107 tok/s |
| Size | 12.4 GiB |
| BPW | 3.07 |
| ngl | 99 (full GPU) |
Fits entirely in 16GB VRAM — no CPU offload needed.
Quality
10/10 correct on standard QA benchmark (capital of France, 2+2, Python reverse string, gravity, WW2, primes, boiling point, Shakespeare, Jupiter, hello→Hola).
Base Model
Qwen/Qwen3.6-35B-A3B- Source:
unsloth/Qwen3.6-35B-A3B-GGUF(Q8_0)
License
Apache 2.0 — same as the base model.
Tool Call Validation
Tested with --jinja on both --reasoning off and --reasoning on --reasoning-budget 2048:
| Test | reasoning off | reasoning on |
|---|---|---|
| Basic tool call trigger | ✅ | ✅ |
| Tool response → final answer (no loop) | ✅ | ✅ |
| Correct tool selection from multiple | ✅ | ✅ |
| No tool call for simple questions | ✅ | ✅ |
| Multi-step tool use | ✅ | ✅ |
| Nested quote escaping retry (no loop) | ✅ | ✅ |
| Total | 10/10 | 10/10 |
Recommended settings for tool-use / agentic workflows
--jinja --reasoning off --reasoning-budget 0 --reasoning-format deepseek
Avoid --presence-penalty above 0.5 for tool-use — high values diversify reasoning tokens but don't improve structured JSON output, and can cause repeated near-identical tool calls in agent loops.
If using --reasoning on, ensure your agent framework detects consecutive identical tool calls and breaks after 2-3 retries.
Run tests yourself
chmod +x test_tool_calls.sh
./test_tool_calls.sh 8085
- Downloads last month
- 3,410
We're not able to determine the quantization variants.
Model tree for YTan2000/Qwen3.6-35B-A3B-TQ3_4S
Base model
Qwen/Qwen3.6-35B-A3B