March 4, 2026

How to Deploy Qwen2.5-VL-3B with vLLM

Younes El Hjouji

Detailed Deployment Instructions

1. Environment Setup

# Create a working directory (choose your preferred location)

# Create virtual environment with Python 3.12
# Make sure you have Python 3.12 installed: sudo apt install python3.12 python3.12-venv
python3.12 -m venv venv

# Activate virtual environment
source venv/bin/activate

# Verify Python version (should be 3.12.x)
python --version

# Upgrade pip
pip install --upgrade pip

# Install vLLM (this takes ~5-10 minutes)
pip install vllm

2. Launch Deployment

# Make sure you're in your working directory and activate virtual environment
source venv/bin/activate

# Launch vLLM serve (full native context)
vllm serve Qwen/Qwen2.5-VL-3B-Instruct \
  --dtype bfloat16 \
  --max-model-len 32768 \
  --max-num-batched-tokens 32768 \
  --gpu-memory-utilization 0.9 \
  --port 8000

3. Verification

3.1. Check server is running:

curl http://localhost:8000/v1/models

Expected output:

{
  "object": "list",
  "data": [{
    "id": "Qwen/Qwen2.5-VL-3B-Instruct",
    "object": "model",
    "created": 1770105752,
    "owned_by": "vllm",
    "root": "Qwen/Qwen2.5-VL-3B-Instruct",
    "parent": null,
    "max_model_len": 32768
  }]
}

3.2. Test text inference:

curl http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen2.5-VL-3B-Instruct",
    "prompt": "Hello, how are you?",
    "max_tokens": 50
  }'

3.3. Test video inference:

# Encode video to base64
VIDEO_BASE64=$(base64 -w 0 /path/to/your/video.mp4)

# Time the request and measure latency
time curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d "{
    \"model\": \"Qwen/Qwen2.5-VL-3B-Instruct\",
    \"messages\": [{
      \"role\": \"user\",
      \"content\": [
        {
          \"type\": \"video_url\",
          \"video_url\": {
            \"url\": \"data:video/mp4;base64,\$VIDEO_BASE64\"
          }
        },
        {
          \"type\": \"text\",
          \"text\": \"Describe the user actions in this video in a sequential list.\"
        }
      ]
    }],
    \"max_tokens\": 1024
  }"

Example output:

1. A soccer player in a red jersey is standing on the field, clapping his hands together.
2. The player continues to clap his hands, looking around the field.
3. The player then turns his head to the side and looks at something off-screen.

Video inference latency: 4.38s (real time)

4. Troubleshooting

Issue 1: Out of memory with full context

Symptom: CUDA out of memory when using 32768 context or when another model is running
Solution: Reduce --gpu-memory-utilization to 0.5 or stop other running models

Issue 2: Slow compilation on first run

Symptom: Server takes 2-3 minutes to fully start
Solution: This is normal. Torch compilation caches graphs for future use. Subsequent startups will be faster.

Issue 3: Model not found

Symptom: Error loading model or model not recognized
Solution: vLLM 0.15.0 supports Qwen2.5-VL models. Ensure you have the latest version.

5. Notes

Context length: Model natively supports 32K tokens with YaRN support for longer sequences
Multimodal support: Supports both image and video inputs with dynamic resolution
Video handling: Can handle videos over 1 hour with dynamic FPS sampling
Vision encoder: Streamlined ViT with window attention
LLM backbone: Qwen2.5-3B
Flash Attention: Automatically enabled for vision encoder (AttentionBackendEnum.FLASH_ATTN)
Torch Compilation: Enabled for backbone with Inductor backend for better performance
Dynamic resolution: Supports custom min_pixels/max_pixels for memory optimization
mRoPE: Uses modified RoPE for temporal dimension in videos

6. Model Specifications

Model size: 4B parameters
Architecture: Qwen2.5-VL (Streamlined ViT + Qwen2.5-3B)
Context window: 32,768 tokens (native)
Video support: Dynamic FPS with mRoPE for temporal encoding
License: Apache-2.0 (check model card for latest)
Quantization support: 68 quantized variants available

7. References

Model Card: https://huggingface.co/Qwen/Qwen2.5-VL-3B-Instruct
Paper: https://huggingface.co/papers/2502.13923
GitHub: https://github.com/QwenLM/Qwen2.5-VL
vLLM Support: Yes (deployed with 0.15.0)
SGLang Support: Yes