Exciting news! TCMS official website is live! Offering full-stack software services including enterprise-level custom R&D, App and mini-program development, multi-system integration, AI, blockchain, and embedded development, empowering digital-intelligent transformation across industries. Visit dev.tekin.cn to discuss cooperation!

Qwen3.5 Hybrid Attention: Gated DeltaNet + MoE Architecture & Deployment Guide

2026-03-06 4 mins read

This article provides a technical deep dive into Qwen3.5, Alibaba’s advanced large language model built on a hybrid attention architecture that combines Gated DeltaNet linear attention and full attention in a 3:1 layer ratio. It supports an ultra-long context window of up to 262K tokens while achieving much higher inference efficiency than traditional Transformers.

en_qwen3_5_hybrid_attention_cover_v2-(1)_20260306223817
Abstract

Qwen3.5 is Alibaba’s latest LLM family built around a hybrid attention mechanism that combines Gated DeltaNet linear attention and standard full attention in a 3:1 layer ratio. This design drastically reduces compute and memory usage while supporting up to 262K token context windows.

This article explains:

  • How Gated DeltaNet and hybrid attention work
  • Sparse MoE routing & efficiency gains
  • Practical model conversion to GGUF
  • Deployment using a Qwen3.5‑compatible fork of llama.cpp

1. Background: Why Hybrid Attention?

Standard full self‑attention has complexity O(L²d), which becomes prohibitively expensive for long sequences. Linear attention approaches achieve O(Ld²) complexity but often lose expressiveness.

Qwen3.5 solves this by:

  • Using Gated DeltaNet for fast, linear-complexity attention
  • Inserting full-attention layers periodically to preserve long-range precision
  • Pairing attention with sparse MoE for conditional compute

The result:

  • Up to 19× higher throughput at 256K context vs. Qwen3-Max
  • Near-linear scaling for long texts
  • Strong performance on reasoning, QA, and code

2. Gated DeltaNet: Core Mechanism

Gated DeltaNet is a linear attention variant built on state space models (SSM) with two key innovations:

2.1 Gating & Delta Update

The memory state evolves as:

S_t = β_t ⊙ S_{t-1} + Δ_t ⊗ (K_t ⊗ V_t)
  • β_t (gate): controls how much past state is retained
  • Δ_t (delta): enables precise, non-destructive updates
  • No need to cache full KV history — only a fixed-size state

This gives O(1) state memory and O(L) compute.

2.2 SSM Components

Each linear layer includes:

  • 1D convolution for local structure
  • State matrix A_log stored in log space for stability
  • dt_proj for time-step gating
  • Skip connection D_proj for training stability

3. Qwen3.5 Hybrid Attention Architecture

3.1 Layer Mixing Strategy

A typical 24‑layer configuration uses:

  • 3× linear attention
  • 1× full attention
  • Repeat across the stack

This keeps ~75% of layers lightweight while preserving full-attention quality.

Example config snippet:

"layer_types": [
  "linear_attention",
  "linear_attention",
  "linear_attention",
  "full_attention",
  ...
],
"full_attention_interval": 4

3.2 Sparse MoE Design

Qwen3.5 uses high-sparsity MoE:

  • Total params: up to 397B
  • Activated per token: only 17B
  • Top‑K routing (typically top‑8 from 64 experts)
  • Shared experts for stability

MoE + hybrid attention = extreme efficiency at scale.


4. Model Comparison

ModelTotal ParamsActive ParamsArchitecture
Qwen3.5‑0.8B0.8B0.8BDense
Qwen3.5‑35B‑A3B35B3BSparse MoE
Qwen3.5‑122B‑A10B122B10BSparse MoE
Qwen3.5‑397B‑A17B397B17BSparse MoE

5. Why Hybrid Beats Pure Linear

Pure linear models (e.g., some Mamba variants) are fast but can degrade on:

  • Needle‑in‑haystack retrieval
  • Fine-grained long‑context QA
  • Complex reasoning

Qwen3.5’s periodic full attention preserves precision while staying near-linear in cost.


6. Deployment: Convert & Run GGUF with llama.cpp

⚠️ Official llama.cpp does NOT support Qwen3.5. Use the fork by tekintian that adds hybrid attention & Gated DeltaNet support: https://github.com/tekintian/llama.cpp

6.1 Steps

  1. Clone the custom fork
  2. Install dependencies
  3. Convert HF model to GGUF
  4. Compile & run inference

6.2 Conversion Command

python convert_hf_to_gguf.py \
  --model your_qwen3.5_model_dir \
  --outfile qwen3.5-f16.gguf \
  --outtype f16

6.3 Inference Example

./llama-cli -m qwen3.5-f16.gguf -p "Your prompt" -c 32768

7. Key Takeaways

  • Qwen3.5 uses 3:1 hybrid attention (Gated DeltaNet + full attention)
  • Linear layers = speed & memory efficiency
  • Full layers = long‑range accuracy
  • Sparse MoE drastically reduces active compute
  • GGUF + custom llama.cpp = practical deployment
  • Supports up to 262K context with manageable hardware

Tags:

Qwen3.5 LLM HybridAttention GatedDeltaNet MoE SparseMoE LongContext LLMDeployment GGUF llama.cpp AI InferenceOptimization

Full Chinese Version :  https://dev.tekin.cn/blog/qwen3-5-hybrid-attention-gated-deltanet-moe-deployment


 

Image NewsLetter
Icon primary
Newsletter

Subscribe our newsletter

Please enter your email address below and click the subscribe button. By doing so, you agree to our Terms and Conditions.

Your experience on this site will be improved by allowing cookies Cookie Policy