Home
Enterprise & Ecosystem
Qwen3.5 Hybrid Attention: Gated DeltaNet + MoE Architecture & Deployment Guide

Qwen3.5 Hybrid Attention: Gated DeltaNet + MoE Architecture & Deployment Guide

Core Technology Insights Open Source & AI Applications Full-Stack Practical Projects Industry Solutions Enterprise & Ecosystem

2026-03-06 4 mins read

This article provides a technical deep dive into Qwen3.5, Alibaba’s advanced large language model built on a hybrid attention architecture that combines Gated DeltaNet linear attention and full attention in a 3:1 layer ratio. It supports an ultra-long context window of up to 262K tokens while achieving much higher inference efficiency than traditional Transformers.

Abstract

Qwen3.5 is Alibaba’s latest LLM family built around a hybrid attention mechanism that combines Gated DeltaNet linear attention and standard full attention in a 3:1 layer ratio. This design drastically reduces compute and memory usage while supporting up to 262K token context windows.

This article explains:

How Gated DeltaNet and hybrid attention work
Sparse MoE routing & efficiency gains
Practical model conversion to GGUF
Deployment using a Qwen3.5‑compatible fork of llama.cpp

1. Background: Why Hybrid Attention?

Standard full self‑attention has complexity O(L²d), which becomes prohibitively expensive for long sequences. Linear attention approaches achieve O(Ld²) complexity but often lose expressiveness.

Qwen3.5 solves this by:

Using Gated DeltaNet for fast, linear-complexity attention
Inserting full-attention layers periodically to preserve long-range precision
Pairing attention with sparse MoE for conditional compute

The result:

Up to 19× higher throughput at 256K context vs. Qwen3-Max
Near-linear scaling for long texts
Strong performance on reasoning, QA, and code

2. Gated DeltaNet: Core Mechanism

Gated DeltaNet is a linear attention variant built on state space models (SSM) with two key innovations:

2.1 Gating & Delta Update

The memory state evolves as:

S_t = β_t ⊙ S_{t-1} + Δ_t ⊗ (K_t ⊗ V_t)

β_t (gate): controls how much past state is retained
Δ_t (delta): enables precise, non-destructive updates
No need to cache full KV history — only a fixed-size state

This gives O(1) state memory and O(L) compute.

2.2 SSM Components

Each linear layer includes:

1D convolution for local structure
State matrix A_log stored in log space for stability
dt_proj for time-step gating
Skip connection D_proj for training stability

3. Qwen3.5 Hybrid Attention Architecture

3.1 Layer Mixing Strategy

A typical 24‑layer configuration uses:

3× linear attention
1× full attention
Repeat across the stack

This keeps ~75% of layers lightweight while preserving full-attention quality.

Example config snippet:

"layer_types": [
  "linear_attention",
  "linear_attention",
  "linear_attention",
  "full_attention",
  ...
],
"full_attention_interval": 4

3.2 Sparse MoE Design

Qwen3.5 uses high-sparsity MoE:

Total params: up to 397B
Activated per token: only 17B
Top‑K routing (typically top‑8 from 64 experts)
Shared experts for stability

MoE + hybrid attention = extreme efficiency at scale.

4. Model Comparison

Model	Total Params	Active Params	Architecture
Qwen3.5‑0.8B	0.8B	0.8B	Dense
Qwen3.5‑35B‑A3B	35B	3B	Sparse MoE
Qwen3.5‑122B‑A10B	122B	10B	Sparse MoE
Qwen3.5‑397B‑A17B	397B	17B	Sparse MoE

5. Why Hybrid Beats Pure Linear

Pure linear models (e.g., some Mamba variants) are fast but can degrade on:

Needle‑in‑haystack retrieval
Fine-grained long‑context QA
Complex reasoning

Qwen3.5’s periodic full attention preserves precision while staying near-linear in cost.

6. Deployment: Convert & Run GGUF with llama.cpp

⚠️ Official llama.cpp does NOT support Qwen3.5. Use the fork by tekintian that adds hybrid attention & Gated DeltaNet support: https://github.com/tekintian/llama.cpp

6.1 Steps

Clone the custom fork
Install dependencies
Convert HF model to GGUF
Compile & run inference

6.2 Conversion Command

python convert_hf_to_gguf.py \
  --model your_qwen3.5_model_dir \
  --outfile qwen3.5-f16.gguf \
  --outtype f16

6.3 Inference Example

./llama-cli -m qwen3.5-f16.gguf -p "Your prompt" -c 32768

7. Key Takeaways

Qwen3.5 uses 3:1 hybrid attention (Gated DeltaNet + full attention)
Linear layers = speed & memory efficiency
Full layers = long‑range accuracy
Sparse MoE drastically reduces active compute
GGUF + custom llama.cpp = practical deployment
Supports up to 262K context with manageable hardware

Tags：

Qwen3.5 LLM HybridAttention GatedDeltaNet MoE SparseMoE LongContext LLMDeployment GGUF llama.cpp AI InferenceOptimization

Full Chinese Version : https://dev.tekin.cn/blog/qwen3-5-hybrid-attention-gated-deltanet-moe-deployment

#Qwen3.5 #Hybrid Attention #GatedDeltaNet #MoE #Large Language Models #Long Context Processing #AI #LLM Deployment #llama.cpp #LLM Inference Optimization #SSM #AI Large Model

Qwen3.5 Hybrid Attention: Gated DeltaNet + MoE Architecture & Deployment Guide

Abstract

1. Background: Why Hybrid Attention?

2. Gated DeltaNet: Core Mechanism

2.1 Gating & Delta Update

2.2 SSM Components

3. Qwen3.5 Hybrid Attention Architecture

3.1 Layer Mixing Strategy

3.2 Sparse MoE Design

4. Model Comparison

5. Why Hybrid Beats Pure Linear

6. Deployment: Convert & Run GGUF with llama.cpp

6.1 Steps

6.2 Conversion Command

6.3 Inference Example

7. Key Takeaways

Tags：

Subscribe our newsletter