How to Deploy Qwen2.5-VL-72B with vLLM
Younes El Hjouji
Detailed Deployment Instructions
1. Environment Setup
# Create a working directory (choose your preferred location)
# Create virtual environment with Python 3.12
python3.12 -m venv venv
# Activate virtual environment
source venv/bin/activate
# Verify Python version (should be 3.12.x)
python --version
# Upgrade pip
pip install --upgrade pip
# Install vLLM
pip install vllm
2. Launch Deployment
# Make sure you're in your working directory and activate virtual environment
source venv/bin/activate
# Use the FP8 quantized variant for single-H200 deployment
vllm serve RedHatAI/Qwen2.5-VL-72B-Instruct-FP8-dynamic \
--dtype bfloat16 \
--max-model-len 16384 \
--max-num-batched-tokens 16384 \
--gpu-memory-utilization 0.90 \
--port 8000
3. Verification
3.1. Check server is running:
curl http://localhost:8000/v1/models
3.2. Test text inference:
curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "RedHatAI/Qwen2.5-VL-72B-Instruct-FP8-dynamic",
"prompt": "Hello, how are you?",
"max_tokens": 50
}'
3.3. Test video inference:
VIDEO_BASE64=$(base64 -w 0 /path/to/video.mp4)
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d "{
\"model\": \"RedHatAI/Qwen2.5-VL-72B-Instruct-FP8-dynamic\",
\"messages\": [{
\"role\": \"user\",
\"content\": [
{
\"type\": \"video_url\",
\"video_url\": {
\"url\": \"data:video/mp4;base64,\$VIDEO_BASE64\"
}
},
{
\"type\": \"text\",
\"text\": \"Describe this video in detail.\"
}
]
}],
\"max_tokens\": 256
}"
Verified Results:
- Deployment succeeded on a single H200 using the FP8 quantized variant.
- The FP8 model used about 71.5 GB for weights, which left practical headroom for KV cache.
- Text and video inference both worked through the OpenAI-compatible vLLM endpoint.
4. Troubleshooting
Issue 1: Out of memory with the original BF16 model
- Solution: For single-H200 deployment, use
RedHatAI/Qwen2.5-VL-72B-Instruct-FP8-dynamicinstead of the original BF16 weights.
Issue 2: Remote video URLs fail
- Solution: Use a local video file encoded to base64 instead of relying on remote video URLs.
Issue 3: Slow first load
- Solution: This is expected for a model of this size. The initial download and load can take several minutes.
5. Notes
- For a single H200, the FP8 quantized variant is the practical deployment path.
- In our deployment, FP8 reduced the weight footprint enough to make the model usable with room for KV cache.
- Native context is 32K, but 16K was the clean working deployment setting here.
- The model supports text, image, and video inputs through vLLM.
6. References
- Original Model: https://huggingface.co/Qwen/Qwen2.5-VL-72B-Instruct
- FP8 Quantized Model: https://huggingface.co/RedHatAI/Qwen2.5-VL-72B-Instruct-FP8-dynamic
- Alternative FP8 Model: https://huggingface.co/parasail-ai/Qwen2.5-VL-72B-Instruct-FP8-Dynamic
- Paper: https://huggingface.co/papers/2502.13923
- GitHub: https://github.com/QwenLM/Qwen2.5-VL