How to Deploy Qwen2.5-VL-3B with vLLM
Younes El Hjouji
Detailed Deployment Instructions
1. Environment Setup
# Create a working directory (choose your preferred location)
# Create virtual environment with Python 3.12
# Make sure you have Python 3.12 installed: sudo apt install python3.12 python3.12-venv
python3.12 -m venv venv
# Activate virtual environment
source venv/bin/activate
# Verify Python version (should be 3.12.x)
python --version
# Upgrade pip
pip install --upgrade pip
# Install vLLM (this takes ~5-10 minutes)
pip install vllm
2. Launch Deployment
# Make sure you're in your working directory and activate virtual environment
source venv/bin/activate
# Launch vLLM serve (full native context)
vllm serve Qwen/Qwen2.5-VL-3B-Instruct \
--dtype bfloat16 \
--max-model-len 32768 \
--max-num-batched-tokens 32768 \
--gpu-memory-utilization 0.9 \
--port 8000
3. Verification
3.1. Check server is running:
curl http://localhost:8000/v1/models
Expected output:
{
"object": "list",
"data": [{
"id": "Qwen/Qwen2.5-VL-3B-Instruct",
"object": "model",
"created": 1770105752,
"owned_by": "vllm",
"root": "Qwen/Qwen2.5-VL-3B-Instruct",
"parent": null,
"max_model_len": 32768
}]
}
3.2. Test text inference:
curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen2.5-VL-3B-Instruct",
"prompt": "Hello, how are you?",
"max_tokens": 50
}'
3.3. Test video inference:
# Encode video to base64
VIDEO_BASE64=$(base64 -w 0 /path/to/your/video.mp4)
# Time the request and measure latency
time curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d "{
\"model\": \"Qwen/Qwen2.5-VL-3B-Instruct\",
\"messages\": [{
\"role\": \"user\",
\"content\": [
{
\"type\": \"video_url\",
\"video_url\": {
\"url\": \"data:video/mp4;base64,\$VIDEO_BASE64\"
}
},
{
\"type\": \"text\",
\"text\": \"Describe the user actions in this video in a sequential list.\"
}
]
}],
\"max_tokens\": 1024
}"
Example output:
1. A soccer player in a red jersey is standing on the field, clapping his hands together.
2. The player continues to clap his hands, looking around the field.
3. The player then turns his head to the side and looks at something off-screen.
Video inference latency: 4.38s (real time)
4. Troubleshooting
Issue 1: Out of memory with full context
- Symptom: CUDA out of memory when using 32768 context or when another model is running
- Solution: Reduce
--gpu-memory-utilizationto 0.5 or stop other running models
Issue 2: Slow compilation on first run
- Symptom: Server takes 2-3 minutes to fully start
- Solution: This is normal. Torch compilation caches graphs for future use. Subsequent startups will be faster.
Issue 3: Model not found
- Symptom: Error loading model or model not recognized
- Solution: vLLM 0.15.0 supports Qwen2.5-VL models. Ensure you have the latest version.
5. Notes
- Context length: Model natively supports 32K tokens with YaRN support for longer sequences
- Multimodal support: Supports both image and video inputs with dynamic resolution
- Video handling: Can handle videos over 1 hour with dynamic FPS sampling
- Vision encoder: Streamlined ViT with window attention
- LLM backbone: Qwen2.5-3B
- Flash Attention: Automatically enabled for vision encoder (AttentionBackendEnum.FLASH_ATTN)
- Torch Compilation: Enabled for backbone with Inductor backend for better performance
- Dynamic resolution: Supports custom min_pixels/max_pixels for memory optimization
- mRoPE: Uses modified RoPE for temporal dimension in videos
6. Model Specifications
- Model size: 4B parameters
- Architecture: Qwen2.5-VL (Streamlined ViT + Qwen2.5-3B)
- Context window: 32,768 tokens (native)
- Video support: Dynamic FPS with mRoPE for temporal encoding
- License: Apache-2.0 (check model card for latest)
- Quantization support: 68 quantized variants available
7. References
- Model Card: https://huggingface.co/Qwen/Qwen2.5-VL-3B-Instruct
- Paper: https://huggingface.co/papers/2502.13923
- GitHub: https://github.com/QwenLM/Qwen2.5-VL
- vLLM Support: Yes (deployed with 0.15.0)
- SGLang Support: Yes