How to Deploy Qwen3-VL-32B with vLLM
Younes El Hjouji
Detailed Deployment Instructions
1. Environment Setup
# Create a working directory (choose your preferred location)
# Create virtual environment with Python 3.12
python3.12 -m venv venv
# Activate virtual environment
source venv/bin/activate
# Verify Python version (should be 3.12.x)
python --version
# Upgrade pip
pip install --upgrade pip
# Install vLLM
pip install vllm
2. Launch Deployment
# Make sure you're in your working directory and activate virtual environment
source venv/bin/activate
# Recommended public-facing configuration
vllm serve Qwen/Qwen3-VL-32B-Instruct \
--dtype bfloat16 \
--max-model-len 8192 \
--max-num-batched-tokens 8192 \
--gpu-memory-utilization 0.75 \
--port 8000
3. Verification
3.1. Check server is running:
curl http://localhost:8000/v1/models
3.2. Test text inference:
curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen3-VL-32B-Instruct",
"prompt": "Hello, how are you?",
"max_tokens": 50
}'
3.3. Test video inference:
VIDEO_BASE64=$(base64 -w 0 /path/to/video.mp4)
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d "{
\"model\": \"Qwen/Qwen3-VL-32B-Instruct\",
\"messages\": [{
\"role\": \"user\",
\"content\": [
{
\"type\": \"video_url\",
\"video_url\": {
\"url\": \"data:video/mp4;base64,\$VIDEO_BASE64\"
}
},
{
\"type\": \"text\",
\"text\": \"Describe this video in detail.\"
}
]
}],
\"max_tokens\": 150
}"
Verified Results:
- Deployment succeeded with the 8K / 75% GPU configuration.
- Text inference returned in about 0.99 seconds for a short prompt.
- A short video inference request completed in about 9.2 seconds and produced a detailed description.
4. Troubleshooting
Issue 1: Not enough free GPU memory
- Solution: Lower
--gpu-memory-utilizationor reduce--max-model-len.
Issue 2: KV cache is too small for the requested context
- Solution: Reduce
--max-model-lenuntil the requested context fits the available GPU memory.
Issue 3: Slow first load
- Solution: This is expected. The initial run includes checkpoint loading, torch compilation, and CUDA graph capture.
5. Notes
- This is the dense Qwen3-VL 32B model, so all parameters are active on every token.
- The tested deployment used about 109 GB at the recommended 8K setting.
- Native context is much larger, but 8K was the clean verified deployment configuration here.
- This model supports text, image, and video inputs through vLLM.
6. References
- Model Card: https://huggingface.co/Qwen/Qwen3-VL-32B-Instruct
- GitHub: https://github.com/QwenLM/Qwen3-VL
- vLLM Documentation: https://docs.vllm.ai/projects/recipes/en/latest/Qwen/Qwen3-VL.html