← Back to all posts

How to Deploy Qwen3-VL-4B with vLLM

Younes El Hjouji

Detailed Deployment Instructions

1. Environment Setup

# Create a working directory (choose your preferred location)

# Create virtual environment with Python 3.12
# Make sure you have Python 3.12 installed: sudo apt install python3.12 python3.12-venv
python3.12 -m venv venv

# Activate virtual environment
source venv/bin/activate

# Verify Python version (should be 3.12.x)
python --version

# Upgrade pip
pip install --upgrade pip

# Install vLLM with video support (this takes ~5-10 minutes)
pip install vllm
pip install 'vllm[video]'

2. Launch Deployment

# Make sure you're in your working directory and activate virtual environment
source venv/bin/activate

# Launch vLLM serve (maximum context for solo deployment)
vllm serve Qwen/Qwen3-VL-4B-Instruct \
  --dtype bfloat16 \
  --max-model-len 65536 \
  --max-num-batched-tokens 65536 \
  --gpu-memory-utilization 0.9 \
  --port 8000

3. Verification

3.1. Check server is running:

curl http://localhost:8000/v1/models

Expected output:

{
  "object": "list",
  "data": [{
    "id": "Qwen/Qwen3-VL-4B-Instruct",
    "object": "model",
    "created": 1769822588,
    "owned_by": "vllm",
    "root": "Qwen/Qwen3-VL-4B-Instruct",
    "parent": null,
    "max_model_len": 65536
  }]
}

3.2. Test text inference:

curl http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen3-VL-4B-Instruct",
    "prompt": "Hello, how are you?",
    "max_tokens": 50
  }'

3.3. Test video inference:

# Encode video to base64
VIDEO_BASE64=$(base64 -w 0 /path/to/your/video.mp4)

# Create request file
cat > /tmp/video_request.json << EOF
{
  "model": "Qwen/Qwen3-VL-4B-Instruct",
  "messages": [{
    "role": "user",
    "content": [
      {
        "type": "video_url",
        "video_url": {
          "url": "data:video/mp4;base64,$VIDEO_BASE64"
        }
      },
      {
        "type": "text",
        "text": "Describe the user actions in this video in a sequential list."
      }
    ]
  }],
  "max_tokens": 1024
}
EOF

# Time the request
time curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d @/tmp/video_request.json

Example output:

A man in a red jersey is standing on a grassy field. He is clapping his hands
together and appears to be speaking or shouting. He turns his head to the left,
still clapping, and then turns back to face forward, continuing to clap.

Video inference latency: 6.538s (real time)

4. Troubleshooting

Issue 1: ModuleNotFoundError for video processing

  • Symptom: "ModuleNotFoundError: No module named 'opencv'"
  • Solution: Install video support with pip install 'vllm[video]'

Issue 2: Slow compilation on first run

  • Symptom: Server takes 2-3 minutes to fully start
  • Solution: This is normal. Torch compilation caches graphs for future use. Subsequent startups will be faster.

5. Notes

  • Context length: Model natively supports 256K tokens (up to 1M). Solo deployment uses 65K context.
  • Multimodal support: Supports both image and video inputs
  • Vision backbone: Standard Qwen3-VL vision encoder
  • LLM backbone: Qwen3-4B
  • Flash Attention: Automatically enabled for vision encoder
  • Torch Compilation: Enabled for backbone with Inductor backend
  • Chunked prefill: Enabled for efficient batch processing
  • Prefix caching: Enabled by default

6. Model Specifications

  • Model size: 4B parameters
  • Architecture: Qwen3-VL (vision encoder + Qwen3-4B language model)
  • Context: 256K tokens native (up to 1M possible)
  • License: Apache-2.0
  • Date published: 2025-10-15
  • Video capability: Full video understanding support

7. Performance Benchmarks

Text inference:

  • Latency: < 1 second for 50 tokens
  • Quality: Comparable to larger models for most tasks

Video inference:

  • Latency: 6.538s for 84KB video (video.mp4)
  • Quality: Accurate scene and action description
  • Memory efficient: Lower memory than 8B variants

8. References