Huihui-Qwen3.6-35B-A3B-abliterated-NVFP4

NVFP4 quantized version of huihui-ai/Huihui-Qwen3.6-35B-A3B-abliterated — an abliterated (uncensored) Qwen 3.6 MoE with 256 experts, 3B active parameters, and state-of-the-art agentic coding performance.

67 GB → 21.9 GB. Single NVIDIA Blackwell GPU. 168 tok/s. Uncensored.

Why This Model

All the power of Qwen3.6-35B-A3B with abliteration — no refusals for local agent workflows:

  • SWE-bench Verified: 73.4 — surpasses models 10x its active size
  • Terminal-Bench 2.0: 51.5 — best-in-class agentic coding
  • 256 experts, 3B active — extreme sparsity = extreme speed
  • 262K-1M context — native 262K, extensible to 1 million tokens
  • Abliterated — no refusals, full capability for research and local deployment
  • Multimodal — vision preserved at full BF16 precision

Key Specs

Base model huihui-ai/Huihui-Qwen3.6-35B-A3B-abliterated
Architecture Qwen3.5 MoE — 35B total, 3B active, 256 experts (8 routed + 1 shared)
Quantization NVFP4 W4A4 (weights FP4, activations FP4, scales FP8)
Format compressed-tensors (native vLLM support)
Tool vllm-project/llm-compressor (main)
Calibration 512 samples, ultrachat_200k, seq_len=2048, moe_calibrate_all_experts=True
Size 21.9 GB
Max context 262,144 tokens (native)
Requires NVIDIA Blackwell GPU (SM 120), vLLM nightly (cu130)

Quickstart

vLLM

vllm serve Lna-Lab/Huihui-Qwen3.6-35B-A3B-abliterated-NVFP4 \
    --max-model-len 32768 \
    --reasoning-parser qwen3 \
    --kv-cache-dtype fp8

With tool calling (agentic)

vllm serve Lna-Lab/Huihui-Qwen3.6-35B-A3B-abliterated-NVFP4 \
    --max-model-len 32768 \
    --reasoning-parser qwen3 \
    --enable-auto-tool-choice \
    --tool-call-parser qwen3_coder \
    --kv-cache-dtype fp8

Docker

docker run --gpus '"device=0"' -p 8016:8016 \
    -v /path/to/model:/models/current:ro \
    --shm-size 16gb \
    vllm/vllm-openai:cu130-nightly \
    vllm serve /models/current --port 8016 --max-model-len 32768 \
    --reasoning-parser qwen3 --kv-cache-dtype fp8

Benchmark

Single NVIDIA RTX PRO 6000 Blackwell (96 GB VRAM).

Test Speed Tokens Result
English (CAP theorem) 160 tok/s 256 PASS
Code (async scheduler) 161 tok/s 512 PASS
Math (Bayes' theorem) 161 tok/s 512 PASS
Japanese (quantum computing) 159 tok/s 256 PASS
Container burst (×3) 168 tok/s 512 PASS — stable

Sustained: ~168 tok/s (single GPU, container).

Quantization Details

Recipe

recipe = QuantizationModifier(
    targets="Linear",
    scheme="NVFP4",
    ignore=["lm_head", "re:.*visual.*", "re:.*mlp.gate$", "re:.*mlp.shared_expert_gate$"],
)

Calibration

  • Dataset: HuggingFaceH4/ultrachat_200k (train_sft split)
  • Samples: 512
  • Max sequence length: 2048
  • moe_calibrate_all_experts=True — ensures all 256 experts receive calibration data

Reproduction

from transformers import Qwen3_5MoeForConditionalGeneration, AutoProcessor, AutoTokenizer
from datasets import load_dataset
from llmcompressor import oneshot
from llmcompressor.modifiers.quantization import QuantizationModifier

MODEL_ID = "huihui-ai/Huihui-Qwen3.6-35B-A3B-abliterated"

model = Qwen3_5MoeForConditionalGeneration.from_pretrained(MODEL_ID, dtype="auto", trust_remote_code=True)
processor = AutoProcessor.from_pretrained(MODEL_ID, trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, trust_remote_code=True)

recipe = QuantizationModifier(
    targets="Linear", scheme="NVFP4",
    ignore=["lm_head", "re:.*visual.*", "re:.*mlp.gate$", "re:.*mlp.shared_expert_gate$"],
)

ds = load_dataset("HuggingFaceH4/ultrachat_200k", split="train_sft[:512]")
ds = ds.shuffle(seed=42)

def preprocess(example):
    return {"text": tokenizer.apply_chat_template(example["messages"], tokenize=False)}
ds = ds.map(preprocess)

def tokenize(sample):
    return tokenizer(sample["text"], padding=False, max_length=2048,
                     truncation=True, add_special_tokens=False)
ds = ds.map(tokenize, remove_columns=ds.column_names)

oneshot(model=model, dataset=ds, recipe=recipe,
        max_seq_length=2048, num_calibration_samples=512,
        moe_calibrate_all_experts=True)

model.save_pretrained("Huihui-Qwen3.6-35B-A3B-abliterated-NVFP4", save_compressed=True)
processor.save_pretrained("Huihui-Qwen3.6-35B-A3B-abliterated-NVFP4")
tokenizer.save_pretrained("Huihui-Qwen3.6-35B-A3B-abliterated-NVFP4")

Environment

Package Version
torch 2.11.0+cu130
transformers 5.5.4
llmcompressor 0.1.dev5 (main)
compressed-tensors 0.15.1a20260414
CUDA 13.0

Requirements

  • GPU: NVIDIA Blackwell (SM 120)
  • VRAM: ~22 GB minimum
  • Software: vLLM nightly (cu130)

Notes

  • Abliterated (uncensored). Use responsibly.
  • Multimodal (vision) preserved in BF16.
  • Gated DeltaNet + Attention hybrid architecture.
  • NVFP4 is Blackwell-specific. Will not work on Ampere/Hopper.
  • Use --kv-cache-dtype fp8 for 2x KV capacity at no quality cost.

Credits

Support the Base Model Author

Downloads last month
993
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for sakamakismile/Huihui-Qwen3.6-35B-A3B-abliterated-NVFP4

Quantized
(3)
this model