March 4, 2026

How to Deploy Qwen2.5-VL-72B with vLLM

Younes El Hjouji

Detailed Deployment Instructions

1. Environment Setup

# Create a working directory (choose your preferred location)

# Create virtual environment with Python 3.12
python3.12 -m venv venv

# Activate virtual environment
source venv/bin/activate

# Verify Python version (should be 3.12.x)
python --version

# Upgrade pip
pip install --upgrade pip

# Install vLLM
pip install vllm

2. Launch Deployment

# Make sure you're in your working directory and activate virtual environment
source venv/bin/activate

# Use the FP8 quantized variant for single-H200 deployment
vllm serve RedHatAI/Qwen2.5-VL-72B-Instruct-FP8-dynamic \
  --dtype bfloat16 \
  --max-model-len 16384 \
  --max-num-batched-tokens 16384 \
  --gpu-memory-utilization 0.90 \
  --port 8000

3. Verification

3.1. Check server is running:

curl http://localhost:8000/v1/models

3.2. Test text inference:

curl http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "RedHatAI/Qwen2.5-VL-72B-Instruct-FP8-dynamic",
    "prompt": "Hello, how are you?",
    "max_tokens": 50
  }'

3.3. Test video inference:

VIDEO_BASE64=$(base64 -w 0 /path/to/video.mp4)

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d "{
    \"model\": \"RedHatAI/Qwen2.5-VL-72B-Instruct-FP8-dynamic\",
    \"messages\": [{
      \"role\": \"user\",
      \"content\": [
        {
          \"type\": \"video_url\",
          \"video_url\": {
            \"url\": \"data:video/mp4;base64,\$VIDEO_BASE64\"
          }
        },
        {
          \"type\": \"text\",
          \"text\": \"Describe this video in detail.\"
        }
      ]
    }],
    \"max_tokens\": 256
  }"

Verified Results:

Deployment succeeded on a single H200 using the FP8 quantized variant.
The FP8 model used about 71.5 GB for weights, which left practical headroom for KV cache.
Text and video inference both worked through the OpenAI-compatible vLLM endpoint.

4. Troubleshooting

Issue 1: Out of memory with the original BF16 model

Solution: For single-H200 deployment, use RedHatAI/Qwen2.5-VL-72B-Instruct-FP8-dynamic instead of the original BF16 weights.

Issue 2: Remote video URLs fail

Solution: Use a local video file encoded to base64 instead of relying on remote video URLs.

Issue 3: Slow first load

Solution: This is expected for a model of this size. The initial download and load can take several minutes.

5. Notes

For a single H200, the FP8 quantized variant is the practical deployment path.
In our deployment, FP8 reduced the weight footprint enough to make the model usable with room for KV cache.
Native context is 32K, but 16K was the clean working deployment setting here.
The model supports text, image, and video inputs through vLLM.

6. References

Original Model: https://huggingface.co/Qwen/Qwen2.5-VL-72B-Instruct
FP8 Quantized Model: https://huggingface.co/RedHatAI/Qwen2.5-VL-72B-Instruct-FP8-dynamic
Alternative FP8 Model: https://huggingface.co/parasail-ai/Qwen2.5-VL-72B-Instruct-FP8-Dynamic
Paper: https://huggingface.co/papers/2502.13923
GitHub: https://github.com/QwenLM/Qwen2.5-VL