March 4, 2026

How to Deploy Qwen3-VL-4B with vLLM

Younes El Hjouji

Detailed Deployment Instructions

1. Environment Setup

# Create a working directory (choose your preferred location)

# Create virtual environment with Python 3.12
# Make sure you have Python 3.12 installed: sudo apt install python3.12 python3.12-venv
python3.12 -m venv venv

# Activate virtual environment
source venv/bin/activate

# Verify Python version (should be 3.12.x)
python --version

# Upgrade pip
pip install --upgrade pip

# Install vLLM with video support (this takes ~5-10 minutes)
pip install vllm
pip install 'vllm[video]'

2. Launch Deployment

# Make sure you're in your working directory and activate virtual environment
source venv/bin/activate

# Launch vLLM serve (maximum context for solo deployment)
vllm serve Qwen/Qwen3-VL-4B-Instruct \
  --dtype bfloat16 \
  --max-model-len 65536 \
  --max-num-batched-tokens 65536 \
  --gpu-memory-utilization 0.9 \
  --port 8000

3. Verification

3.1. Check server is running:

curl http://localhost:8000/v1/models

Expected output:

{
  "object": "list",
  "data": [{
    "id": "Qwen/Qwen3-VL-4B-Instruct",
    "object": "model",
    "created": 1769822588,
    "owned_by": "vllm",
    "root": "Qwen/Qwen3-VL-4B-Instruct",
    "parent": null,
    "max_model_len": 65536
  }]
}

3.2. Test text inference:

curl http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen3-VL-4B-Instruct",
    "prompt": "Hello, how are you?",
    "max_tokens": 50
  }'

3.3. Test video inference:

# Encode video to base64
VIDEO_BASE64=$(base64 -w 0 /path/to/your/video.mp4)

# Create request file
cat > /tmp/video_request.json << EOF
{
  "model": "Qwen/Qwen3-VL-4B-Instruct",
  "messages": [{
    "role": "user",
    "content": [
      {
        "type": "video_url",
        "video_url": {
          "url": "data:video/mp4;base64,$VIDEO_BASE64"
        }
      },
      {
        "type": "text",
        "text": "Describe the user actions in this video in a sequential list."
      }
    ]
  }],
  "max_tokens": 1024
}
EOF

# Time the request
time curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d @/tmp/video_request.json

Example output:

A man in a red jersey is standing on a grassy field. He is clapping his hands
together and appears to be speaking or shouting. He turns his head to the left,
still clapping, and then turns back to face forward, continuing to clap.

Video inference latency: 6.538s (real time)

4. Troubleshooting

Issue 1: ModuleNotFoundError for video processing

Symptom: "ModuleNotFoundError: No module named 'opencv'"
Solution: Install video support with pip install 'vllm[video]'

Issue 2: Slow compilation on first run

Symptom: Server takes 2-3 minutes to fully start
Solution: This is normal. Torch compilation caches graphs for future use. Subsequent startups will be faster.

5. Notes

Context length: Model natively supports 256K tokens (up to 1M). Solo deployment uses 65K context.
Multimodal support: Supports both image and video inputs
Vision backbone: Standard Qwen3-VL vision encoder
LLM backbone: Qwen3-4B
Flash Attention: Automatically enabled for vision encoder
Torch Compilation: Enabled for backbone with Inductor backend
Chunked prefill: Enabled for efficient batch processing
Prefix caching: Enabled by default

6. Model Specifications

Model size: 4B parameters
Architecture: Qwen3-VL (vision encoder + Qwen3-4B language model)
Context: 256K tokens native (up to 1M possible)
License: Apache-2.0
Date published: 2025-10-15
Video capability: Full video understanding support

7. Performance Benchmarks

Text inference:

Latency: < 1 second for 50 tokens
Quality: Comparable to larger models for most tasks

Video inference:

Latency: 6.538s for 84KB video (video.mp4)
Quality: Accurate scene and action description
Memory efficient: Lower memory than 8B variants

8. References

Model Card: https://huggingface.co/Qwen/Qwen3-VL-4B-Instruct
Paper: https://huggingface.co/papers/2511.21631
GitHub: https://github.com/QwenLM/Qwen3-VL
vLLM Documentation: https://docs.vllm.ai/
Minimum vLLM version: 0.11.0 (deployed with 0.15.0)