March 4, 2026

How to Deploy Qwen3-VL-2B with vLLM

Younes El Hjouji

Detailed Deployment Instructions

1. Environment Setup

# Create a working directory (choose your preferred location)

# Create virtual environment with Python 3.12
python3.12 -m venv venv

# Activate virtual environment
source venv/bin/activate

# Verify Python version (should be 3.12.x)
python --version

# Upgrade pip
pip install --upgrade pip

# Install vLLM
pip install vllm

2. Launch Deployment

# Make sure you're in your working directory and activate virtual environment
source venv/bin/activate

# Recommended public-facing configuration
vllm serve Qwen/Qwen3-VL-2B-Instruct \
  --dtype bfloat16 \
  --max-model-len 8192 \
  --max-num-batched-tokens 8192 \
  --gpu-memory-utilization 0.25 \
  --port 8000

3. Verification

3.1. Check server is running:

curl http://localhost:8000/v1/models

3.2. Test text inference:

curl http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen3-VL-2B-Instruct",
    "prompt": "Hello, how are you?",
    "max_tokens": 50
  }'

3.3. Test video inference:

VIDEO_BASE64=$(base64 -w 0 /path/to/video.mp4)

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d "{
    \"model\": \"Qwen/Qwen3-VL-2B-Instruct\",
    \"messages\": [{
      \"role\": \"user\",
      \"content\": [
        {
          \"type\": \"video_url\",
          \"video_url\": {
            \"url\": \"data:video/mp4;base64,\$VIDEO_BASE64\"
          }
        },
        {
          \"type\": \"text\",
          \"text\": \"Describe this video in detail.\"
        }
      ]
    }],
    \"max_tokens\": 100
  }"

Verified Results:

Deployment succeeded with the 8K / 25% GPU configuration.
Short text inference returned in under a second.
A short video inference request completed in about 6.2 seconds and produced a correct description.

4. Troubleshooting

Issue 1: Not enough free GPU memory

Solution: Lower --gpu-memory-utilization or reduce --max-model-len.

Issue 2: Incorrect video content format

Solution: Use video_url in the request payload, not video.

Issue 3: Slow first load

Solution: This is expected on the first run while weights are downloaded and the model is compiled.

5. Notes

This is the smallest Qwen3-VL model and the easiest one to deploy on smaller GPU budgets.
Native context is much larger, but 8K was the clean verified deployment configuration here.
The model supports text, image, and video inputs through vLLM.

6. References

Model Card: https://huggingface.co/Qwen/Qwen3-VL-2B-Instruct
GitHub: https://github.com/QwenLM/Qwen3-VL
vLLM Documentation: https://docs.vllm.ai/projects/recipes/en/latest/Qwen/Qwen3-VL.html