March 4, 2026

How to Deploy Qwen3-VL-32B with vLLM

Younes El Hjouji

Detailed Deployment Instructions

1. Environment Setup

# Create a working directory (choose your preferred location)

# Create virtual environment with Python 3.12
python3.12 -m venv venv

# Activate virtual environment
source venv/bin/activate

# Verify Python version (should be 3.12.x)
python --version

# Upgrade pip
pip install --upgrade pip

# Install vLLM
pip install vllm

2. Launch Deployment

# Make sure you're in your working directory and activate virtual environment
source venv/bin/activate

# Recommended public-facing configuration
vllm serve Qwen/Qwen3-VL-32B-Instruct \
  --dtype bfloat16 \
  --max-model-len 8192 \
  --max-num-batched-tokens 8192 \
  --gpu-memory-utilization 0.75 \
  --port 8000

3. Verification

3.1. Check server is running:

curl http://localhost:8000/v1/models

3.2. Test text inference:

curl http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen3-VL-32B-Instruct",
    "prompt": "Hello, how are you?",
    "max_tokens": 50
  }'

3.3. Test video inference:

VIDEO_BASE64=$(base64 -w 0 /path/to/video.mp4)

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d "{
    \"model\": \"Qwen/Qwen3-VL-32B-Instruct\",
    \"messages\": [{
      \"role\": \"user\",
      \"content\": [
        {
          \"type\": \"video_url\",
          \"video_url\": {
            \"url\": \"data:video/mp4;base64,\$VIDEO_BASE64\"
          }
        },
        {
          \"type\": \"text\",
          \"text\": \"Describe this video in detail.\"
        }
      ]
    }],
    \"max_tokens\": 150
  }"

Verified Results:

Deployment succeeded with the 8K / 75% GPU configuration.
Text inference returned in about 0.99 seconds for a short prompt.
A short video inference request completed in about 9.2 seconds and produced a detailed description.

4. Troubleshooting

Issue 1: Not enough free GPU memory

Solution: Lower --gpu-memory-utilization or reduce --max-model-len.

Issue 2: KV cache is too small for the requested context

Solution: Reduce --max-model-len until the requested context fits the available GPU memory.

Issue 3: Slow first load

Solution: This is expected. The initial run includes checkpoint loading, torch compilation, and CUDA graph capture.

5. Notes

This is the dense Qwen3-VL 32B model, so all parameters are active on every token.
The tested deployment used about 109 GB at the recommended 8K setting.
Native context is much larger, but 8K was the clean verified deployment configuration here.
This model supports text, image, and video inputs through vLLM.

6. References

Model Card: https://huggingface.co/Qwen/Qwen3-VL-32B-Instruct
GitHub: https://github.com/QwenLM/Qwen3-VL
vLLM Documentation: https://docs.vllm.ai/projects/recipes/en/latest/Qwen/Qwen3-VL.html