March 4, 2026

How to Deploy MiniCPM-V-4.5 with vLLM

Younes El Hjouji

Detailed Deployment Instructions

1. Environment Setup

# Create a working directory (choose your preferred location)

# Create virtual environment with Python 3.12
python3.12 -m venv venv

# Activate virtual environment
source venv/bin/activate

# Verify Python version (should be 3.12.x)
python --version

# Upgrade pip
pip install --upgrade pip

# Install vLLM with video support
pip install vllm
pip install 'vllm[video]'

2. Launch Deployment

# Make sure you're in your working directory and activate virtual environment
source venv/bin/activate

vllm serve openbmb/MiniCPM-V-4_5 \
  --trust-remote-code \
  --dtype auto \
  --max-model-len 40960 \
  --max-num-batched-tokens 40960 \
  --gpu-memory-utilization 0.9 \
  --port 8000

3. Verification

3.1. Check server is running:

curl http://localhost:8000/v1/models

3.2. Test text inference:

curl http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "openbmb/MiniCPM-V-4_5",
    "prompt": "Hello, how are you?",
    "max_tokens": 50
  }'

3.3. Test video inference:

VIDEO_BASE64=$(base64 -w 0 /path/to/video.mp4)

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d "{
    \"model\": \"openbmb/MiniCPM-V-4_5\",
    \"messages\": [{
      \"role\": \"user\",
      \"content\": [
        {
          \"type\": \"video_url\",
          \"video_url\": {
            \"url\": \"data:video/mp4;base64,\$VIDEO_BASE64\"
          }
        },
        {
          \"type\": \"text\",
          \"text\": \"Describe the user actions in this video in a sequential list.\"
        }
      ]
    }],
    \"max_tokens\": 1024
  }"

Verified Results:

Deployment succeeded with full native context.
Text inference returned quickly for short prompts.
A short video inference request completed in about 8.6 seconds and produced the correct action summary.

4. Troubleshooting

Issue 1: Missing video dependencies

Solution: Install vllm[video].

Issue 2: trust_remote_code required

Solution: Include --trust-remote-code in the launch command.

Issue 3: Out of memory

Solution: Reduce --gpu-memory-utilization or lower --max-model-len.

Issue 4: Slow first load

Solution: This is expected on the first run while the model is downloaded and compiled.

5. Notes

MiniCPM-V-4.5 supports text, image, and video inputs through vLLM.
The model uses trust_remote_code, so keep that flag in place.
The tested deployment used the full native 40,960-token context.

6. References

Model Card: https://huggingface.co/openbmb/MiniCPM-V-4_5
Paper: https://arxiv.org/abs/2509.18154
GitHub: https://github.com/OpenBMB/MiniCPM-o
vLLM Documentation: https://docs.vllm.ai/