How to Deploy InternVL3.5-4B with vLLM
Younes El Hjouji
Detailed Deployment Instructions
1. Environment Setup
# Create a working directory (choose your preferred location)
# Create virtual environment with Python 3.12
# Make sure you have Python 3.12 installed: sudo apt install python3.12 python3.12-venv
python3.12 -m venv venv
# Activate virtual environment
source venv/bin/activate
# Verify Python version (should be 3.12.x)
python --version
# Upgrade pip
pip install --upgrade pip
# Install vLLM with video support (this takes ~5-10 minutes)
pip install vllm
pip install 'vllm[video]'
2. Launch Deployment
# Make sure you're in your working directory and activate virtual environment
source venv/bin/activate
# Launch vLLM serve (full native context)
vllm serve OpenGVLab/InternVL3_5-4B \
--trust-remote-code \
--dtype bfloat16 \
--max-model-len 32768 \
--max-num-batched-tokens 32768 \
--gpu-memory-utilization 0.9 \
--port 8000
3. Verification
3.1. Check server is running:
curl http://localhost:8000/v1/models
Expected output:
{
"object": "list",
"data": [{
"id": "OpenGVLab/InternVL3_5-4B",
"object": "model",
"created": 1769822565,
"owned_by": "vllm",
"root": "OpenGVLab/InternVL3_5-4B",
"parent": null,
"max_model_len": 32768
}]
}
3.2. Test text inference:
curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "OpenGVLab/InternVL3_5-4B",
"prompt": "Hello, how are you?",
"max_tokens": 50
}'
3.3. Test video inference:
# Encode video to base64
VIDEO_BASE64=$(base64 -w 0 /path/to/your/video.mp4)
# Create request file
cat > /tmp/video_request.json << EOF
{
"model": "OpenGVLab/InternVL3_5-4B",
"messages": [{
"role": "user",
"content": [
{
"type": "video_url",
"video_url": {
"url": "data:video/mp4;base64,$VIDEO_BASE64"
}
},
{
"type": "text",
"text": "Describe the user actions in this video in a sequential list."
}
]
}],
"max_tokens": 1024
}
EOF
# Time the request
time curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d @/tmp/video_request.json
Example output:
A soccer player is seen shaking his hands and team mates can be seen sitting behind him.
Video inference latency: 0.631s (real time) - Exceptionally fast!
4. Troubleshooting
Issue 1: Trust remote code error
- Symptom: "Please pass the argument
trust_remote_code=True" - Solution: Add
--trust-remote-codeflag (REQUIRED for InternVL models)
Issue 2: ModuleNotFoundError for video processing
- Symptom: "ModuleNotFoundError: No module named 'opencv'"
- Solution: Install video support with
pip install 'vllm[video]'
Issue 3: Tokenizer warnings about truncation
- Symptom: "The following intended overrides are not keyword args and will be dropped: {'truncation'}"
- Solution: This is a warning, not an error. Model works correctly despite the warning.
5. Notes
- Context length: Model natively supports 32K tokens. Solo deployment uses full 32K context.
- Multimodal support: Supports both image and video inputs
- Vision backbone: InternViT-300M-448px-V2.5
- LLM backbone: Qwen2.5-3B
- Architecture: InternVL3.5 (Qwen3-VL based)
- Flash Attention: Automatically enabled for vision encoder
- Torch Compilation: Enabled for backbone with Inductor backend
- Trust remote code: REQUIRED - model uses custom code for multimodal processing
- Performance: Fastest 4B video model (0.631s vs 6.5s for Qwen)
6. Model Specifications
- Model size: 5B parameters (officially listed as 4B class)
- Architecture: InternVL3.5 (vision encoder + Qwen2.5-3B language model)
- Context: 32,768 tokens (native SFT length)
- License: Apache-2.0
- Date published: 2025-08-26
- Video capability: Full video understanding support
- Special features: Part of InternVL3.5 family with enhanced vision capabilities
7. Performance Benchmarks
Text inference:
- Latency: < 1 second for 50 tokens
- Quality: High quality from Qwen2.5 backbone
Video inference:
- Latency: 0.631s for 84KB video (video.mp4) - 10x faster than Qwen 4B (6.5s)
- Quality: Accurate scene description with efficient encoding
- Memory efficient: Optimized video token encoding
8. References
- Model Card: https://huggingface.co/OpenGVLab/InternVL3_5-4B
- Paper: https://huggingface.co/papers/2508.18265
- GitHub: https://github.com/OpenGVLab/InternVL
- vLLM Documentation: https://docs.vllm.ai/
- InternVL3.5 announcement: https://huggingface.co/papers/2508.18265