← Back to all posts

How to Deploy Qwen3-VL-8B with vLLM

Younes El Hjouji

Detailed Deployment Instructions

1. Environment Setup

# Create a working directory (choose your preferred location)

# Create virtual environment with Python 3.12 (3.14 not supported by numba)
# Make sure you have Python 3.12 installed: sudo apt install python3.12 python3.12-venv
python3.12 -m venv venv

# Activate virtual environment
source venv/bin/activate

# Verify Python version (should be 3.12.x)
python --version

# Upgrade pip
pip install --upgrade pip

# Install vLLM (this takes ~5-10 minutes)
pip install vllm

2. Launch Deployment

# Make sure you're in your working directory and activate virtual environment
source venv/bin/activate

# Launch vLLM serve (maximum context for solo deployment)
vllm serve Qwen/Qwen3-VL-8B-Instruct \
  --dtype bfloat16 \
  --max-model-len 65536 \
  --max-num-batched-tokens 65536 \
  --gpu-memory-utilization 0.9 \
  --port 8000

3. Verification

Check server is running:

curl http://localhost:8000/v1/models

Expected output:

{
  "object": "list",
  "data": [{
    "id": "Qwen/Qwen3-VL-8B-Instruct",
    "object": "model",
    "created": 1769818884,
    "owned_by": "vllm",
    "root": "Qwen/Qwen3-VL-8B-Instruct",
    "parent": null,
    "max_model_len": 65536
  }]
}

Test text inference:

curl http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen3-VL-8B-Instruct",
    "prompt": "Hello, how are you?",
    "max_tokens": 50
  }'

Test with image (multimodal):

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen3-VL-8B-Instruct",
    "messages": [{
      "role": "user",
      "content": [
        {"type": "text", "text": "What is in this image?"},
        {"type": "image_url", "image_url": {"url": "https://example.com/image.jpg"}}
      ]
    }],
    "max_tokens": 100
  }'

Test with video (base64 encoded):

VIDEO_BASE64=$(base64 -w 0 /path/to/video.mp4)
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d "{
    \"model\": \"Qwen/Qwen3-VL-8B-Instruct\",
    \"messages\": [{
      \"role\": \"user\",
      \"content\": [
        {
          \"type\": \"video_url\",
          \"video_url\": {
            \"url\": \"data:video/mp4;base64,\$VIDEO_BASE64\"
          }
        },
        {
          \"type\": \"text\",
          \"text\": \"Describe the user actions in this video in a sequential list.\"
        }
      ]
    }],
    \"max_tokens\": 1024
  }"

Example output:

1. A soccer player in a red jersey is standing on a grass field.
2. He is gesturing with his hands, appearing to be communicating or arguing with someone off-camera.
3. He continues to gesture and speak, moving slightly as he does so.
4. The camera focuses on him as he keeps talking and gesturing.

4. Troubleshooting

Issue 1: Python 3.14 not supported by numba

  • Solution: Use Python 3.12 or 3.13 instead. Recreate venv with python3.12 -m venv venv

Issue 2: Slow compilation on first run

  • Symptom: Server takes 2-3 minutes to fully start
  • Solution: This is normal. Torch compilation caches graphs for future use. Subsequent startups will be faster.

5. Notes

  • Context length: Model natively supports 256K tokens (up to 1M). Solo deployment uses 65K context.
  • Multimodal support: Supports both image and video inputs
  • Flash Attention: Automatically enabled for vision encoder (AttentionBackendEnum.FLASH_ATTN)
  • Torch Compilation: Enabled for backbone with Inductor backend for better performance
  • Chunked prefill: Enabled with max_num_batched_tokens=16384
  • Asynchronous scheduling: Enabled for better throughput
  • Prefix caching: Enabled by default

6. References