How to Deploy Qwen3.5-35B-A3B with vLLM
Younes El Hjouji
Detailed Deployment Instructions
1. Environment Setup
# Create a working directory (choose your preferred location)
# Create virtual environment with Python 3.12
python3.12 -m venv venv
# Activate virtual environment
source venv/bin/activate
# Verify Python version (should be 3.12.x)
python --version
# Upgrade pip
pip install --upgrade pip
# Install vLLM nightly (REQUIRED — Qwen3.5 is not yet in stable vLLM)
# Option A: Using uv (faster)
pip install uv
uv pip install vllm --torch-backend=auto --extra-index-url https://wheels.vllm.ai/nightly
# Option B: Using pip directly
pip install vllm --extra-index-url https://wheels.vllm.ai/nightly
# Install video support
pip install 'vllm[video]'
2. Launch Deployment
# Make sure you're in your working directory and activate virtual environment
source venv/bin/activate
# Launch vLLM serve
vllm serve Qwen/Qwen3.5-35B-A3B \
--dtype bfloat16 \
--max-model-len 16384 \
--max-num-batched-tokens 16384 \
--gpu-memory-utilization 0.90 \
--port 8000
Note: First model download takes ~5-10 minutes. First inference request triggers CUDA graph compilation which takes ~4-5 minutes. Subsequent requests are fast (~0.1s for text).
3. Verification
3.1. Check server is running:
curl http://localhost:8000/v1/models
3.2. Test text inference:
curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen3.5-35B-A3B",
"prompt": "Hello, how are you?",
"max_tokens": 50
}'
3.3. Test video inference:
VIDEO_BASE64=$(base64 -w 0 /path/to/your/video.mp4)
time curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d "{
\"model\": \"Qwen/Qwen3.5-35B-A3B\",
\"messages\": [{
\"role\": \"user\",
\"content\": [
{
\"type\": \"video_url\",
\"video_url\": {
\"url\": \"data:video/mp4;base64,\$VIDEO_BASE64\"
}
},
{
\"type\": \"text\",
\"text\": \"Describe the user actions in this video in a sequential list.\"
}
]
}],
\"max_tokens\": 1024
}"
Verified Results:
- Text inference: Working. Response in ~0.14s (warm).
- Video inference: Working. video.mp4 (6s clip) processed in ~9.7s, 1024 tokens generated.
- Video output correctly identified soccer player (Wayne Rooney), actions (clapping, shouting), setting (green field), and temporal sequence.
4. Troubleshooting
Issue 1: First request takes several minutes
- Solution: This is expected. CUDA graph compilation happens on the first request. Subsequent requests are fast.
Issue 2: Requires vLLM nightly
- Solution: Qwen3.5 architecture (
qwen3_5_moe) is only supported in vLLM nightly builds. Install fromhttps://wheels.vllm.ai/nightly.
5. Notes
- This is a MoE model: 35B total parameters but only 3B active per token. Despite the large total parameter count, inference is efficient.
- GPU memory is high (~131GB) because all 256 expert weights must be loaded, even though only 8+1 are active per token.
- The model uses a novel hybrid attention mechanism (Gated DeltaNet + Gated Attention) which provides faster inference on long contexts compared to standard Transformer attention.
- The model naturally produces
<think>reasoning traces in completions. This is expected behavior. - Chunked prefill is automatically enabled with max_num_batched_tokens=16384.