March 4, 2026

How to Deploy InternVL3.5-4B with vLLM

Younes El Hjouji

Detailed Deployment Instructions

1. Environment Setup

# Create a working directory (choose your preferred location)

# Create virtual environment with Python 3.12
# Make sure you have Python 3.12 installed: sudo apt install python3.12 python3.12-venv
python3.12 -m venv venv

# Activate virtual environment
source venv/bin/activate

# Verify Python version (should be 3.12.x)
python --version

# Upgrade pip
pip install --upgrade pip

# Install vLLM with video support (this takes ~5-10 minutes)
pip install vllm
pip install 'vllm[video]'

2. Launch Deployment

# Make sure you're in your working directory and activate virtual environment
source venv/bin/activate

# Launch vLLM serve (full native context)
vllm serve OpenGVLab/InternVL3_5-4B \
  --trust-remote-code \
  --dtype bfloat16 \
  --max-model-len 32768 \
  --max-num-batched-tokens 32768 \
  --gpu-memory-utilization 0.9 \
  --port 8000

3. Verification

3.1. Check server is running:

curl http://localhost:8000/v1/models

Expected output:

{
  "object": "list",
  "data": [{
    "id": "OpenGVLab/InternVL3_5-4B",
    "object": "model",
    "created": 1769822565,
    "owned_by": "vllm",
    "root": "OpenGVLab/InternVL3_5-4B",
    "parent": null,
    "max_model_len": 32768
  }]
}

3.2. Test text inference:

curl http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "OpenGVLab/InternVL3_5-4B",
    "prompt": "Hello, how are you?",
    "max_tokens": 50
  }'

3.3. Test video inference:

# Encode video to base64
VIDEO_BASE64=$(base64 -w 0 /path/to/your/video.mp4)

# Create request file
cat > /tmp/video_request.json << EOF
{
  "model": "OpenGVLab/InternVL3_5-4B",
  "messages": [{
    "role": "user",
    "content": [
      {
        "type": "video_url",
        "video_url": {
          "url": "data:video/mp4;base64,$VIDEO_BASE64"
        }
      },
      {
        "type": "text",
        "text": "Describe the user actions in this video in a sequential list."
      }
    ]
  }],
  "max_tokens": 1024
}
EOF

# Time the request
time curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d @/tmp/video_request.json

Example output:

A soccer player is seen shaking his hands and team mates can be seen sitting behind him.

Video inference latency: 0.631s (real time) - Exceptionally fast!

4. Troubleshooting

Issue 1: Trust remote code error

Symptom: "Please pass the argument trust_remote_code=True"
Solution: Add --trust-remote-code flag (REQUIRED for InternVL models)

Issue 2: ModuleNotFoundError for video processing

Symptom: "ModuleNotFoundError: No module named 'opencv'"
Solution: Install video support with pip install 'vllm[video]'

Issue 3: Tokenizer warnings about truncation

Symptom: "The following intended overrides are not keyword args and will be dropped: {'truncation'}"
Solution: This is a warning, not an error. Model works correctly despite the warning.

5. Notes

Context length: Model natively supports 32K tokens. Solo deployment uses full 32K context.
Multimodal support: Supports both image and video inputs
Vision backbone: InternViT-300M-448px-V2.5
LLM backbone: Qwen2.5-3B
Architecture: InternVL3.5 (Qwen3-VL based)
Flash Attention: Automatically enabled for vision encoder
Torch Compilation: Enabled for backbone with Inductor backend
Trust remote code: REQUIRED - model uses custom code for multimodal processing
Performance: Fastest 4B video model (0.631s vs 6.5s for Qwen)

6. Model Specifications

Model size: 5B parameters (officially listed as 4B class)
Architecture: InternVL3.5 (vision encoder + Qwen2.5-3B language model)
Context: 32,768 tokens (native SFT length)
License: Apache-2.0
Date published: 2025-08-26
Video capability: Full video understanding support
Special features: Part of InternVL3.5 family with enhanced vision capabilities

7. Performance Benchmarks

Text inference:

Latency: < 1 second for 50 tokens
Quality: High quality from Qwen2.5 backbone

Video inference:

Latency: 0.631s for 84KB video (video.mp4) - 10x faster than Qwen 4B (6.5s)
Quality: Accurate scene description with efficient encoding
Memory efficient: Optimized video token encoding

8. References

Model Card: https://huggingface.co/OpenGVLab/InternVL3_5-4B
Paper: https://huggingface.co/papers/2508.18265
GitHub: https://github.com/OpenGVLab/InternVL
vLLM Documentation: https://docs.vllm.ai/
InternVL3.5 announcement: https://huggingface.co/papers/2508.18265