← Back to all posts

How to Deploy Qwen2.5-VL-72B with vLLM

Younes El Hjouji

Detailed Deployment Instructions

1. Environment Setup

# Create a working directory (choose your preferred location)

# Create virtual environment with Python 3.12
python3.12 -m venv venv

# Activate virtual environment
source venv/bin/activate

# Verify Python version (should be 3.12.x)
python --version

# Upgrade pip
pip install --upgrade pip

# Install vLLM
pip install vllm

2. Launch Deployment

# Make sure you're in your working directory and activate virtual environment
source venv/bin/activate

# Use the FP8 quantized variant for single-H200 deployment
vllm serve RedHatAI/Qwen2.5-VL-72B-Instruct-FP8-dynamic \
  --dtype bfloat16 \
  --max-model-len 16384 \
  --max-num-batched-tokens 16384 \
  --gpu-memory-utilization 0.90 \
  --port 8000

3. Verification

3.1. Check server is running:

curl http://localhost:8000/v1/models

3.2. Test text inference:

curl http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "RedHatAI/Qwen2.5-VL-72B-Instruct-FP8-dynamic",
    "prompt": "Hello, how are you?",
    "max_tokens": 50
  }'

3.3. Test video inference:

VIDEO_BASE64=$(base64 -w 0 /path/to/video.mp4)

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d "{
    \"model\": \"RedHatAI/Qwen2.5-VL-72B-Instruct-FP8-dynamic\",
    \"messages\": [{
      \"role\": \"user\",
      \"content\": [
        {
          \"type\": \"video_url\",
          \"video_url\": {
            \"url\": \"data:video/mp4;base64,\$VIDEO_BASE64\"
          }
        },
        {
          \"type\": \"text\",
          \"text\": \"Describe this video in detail.\"
        }
      ]
    }],
    \"max_tokens\": 256
  }"

Verified Results:

  • Deployment succeeded on a single H200 using the FP8 quantized variant.
  • The FP8 model used about 71.5 GB for weights, which left practical headroom for KV cache.
  • Text and video inference both worked through the OpenAI-compatible vLLM endpoint.

4. Troubleshooting

Issue 1: Out of memory with the original BF16 model

  • Solution: For single-H200 deployment, use RedHatAI/Qwen2.5-VL-72B-Instruct-FP8-dynamic instead of the original BF16 weights.

Issue 2: Remote video URLs fail

  • Solution: Use a local video file encoded to base64 instead of relying on remote video URLs.

Issue 3: Slow first load

  • Solution: This is expected for a model of this size. The initial download and load can take several minutes.

5. Notes

  • For a single H200, the FP8 quantized variant is the practical deployment path.
  • In our deployment, FP8 reduced the weight footprint enough to make the model usable with room for KV cache.
  • Native context is 32K, but 16K was the clean working deployment setting here.
  • The model supports text, image, and video inputs through vLLM.

6. References