Exciting news! TCMS official website is live! Offering full-stack software services including enterprise-level custom R&D, App and mini-program development, multi-system integration, AI, blockchain, and embedded development, empowering digital-intelligent transformation across industries. Visit dev.tekin.cn to discuss cooperation!
This article provides a technical deep dive into Qwen3.5, Alibaba’s advanced large language model built on a hybrid attention architecture that combines Gated DeltaNet linear attention and full attention in a 3:1 layer ratio. It supports an ultra-long context window of up to 262K tokens while achieving much higher inference efficiency than traditional Transformers.

Qwen3.5 is Alibaba’s latest LLM family built around a hybrid attention mechanism that combines Gated DeltaNet linear attention and standard full attention in a 3:1 layer ratio. This design drastically reduces compute and memory usage while supporting up to 262K token context windows.
This article explains:
Standard full self‑attention has complexity O(L²d), which becomes prohibitively expensive for long sequences. Linear attention approaches achieve O(Ld²) complexity but often lose expressiveness.
Qwen3.5 solves this by:
The result:
Gated DeltaNet is a linear attention variant built on state space models (SSM) with two key innovations:
The memory state evolves as:
S_t = β_t ⊙ S_{t-1} + Δ_t ⊗ (K_t ⊗ V_t)
This gives O(1) state memory and O(L) compute.
Each linear layer includes:
A_log stored in log space for stabilitydt_proj for time-step gatingD_proj for training stabilityA typical 24‑layer configuration uses:
This keeps ~75% of layers lightweight while preserving full-attention quality.
Example config snippet:
"layer_types": [
"linear_attention",
"linear_attention",
"linear_attention",
"full_attention",
...
],
"full_attention_interval": 4
Qwen3.5 uses high-sparsity MoE:
MoE + hybrid attention = extreme efficiency at scale.
Pure linear models (e.g., some Mamba variants) are fast but can degrade on:
Qwen3.5’s periodic full attention preserves precision while staying near-linear in cost.
⚠️ Official llama.cpp does NOT support Qwen3.5. Use the fork by tekintian that adds hybrid attention & Gated DeltaNet support: https://github.com/tekintian/llama.cpp
python convert_hf_to_gguf.py \
--model your_qwen3.5_model_dir \
--outfile qwen3.5-f16.gguf \
--outtype f16
./llama-cli -m qwen3.5-f16.gguf -p "Your prompt" -c 32768
Qwen3.5 LLM HybridAttention GatedDeltaNet MoE SparseMoE LongContext LLMDeployment GGUF llama.cpp AI InferenceOptimization
Full Chinese Version : https://dev.tekin.cn/blog/qwen3-5-hybrid-attention-gated-deltanet-moe-deployment