How to Deploy InternVL3-2B with vLLM
Younes El Hjouji
Detailed Deployment Instructions
1. Environment Setup
# Create a working directory (choose your preferred location)
# Create virtual environment with Python 3.12
# Make sure you have Python 3.12 installed: sudo apt install python3.12 python3.12-venv
python3.12 -m venv venv
# Activate virtual environment
source venv/bin/activate
# Verify Python version (should be 3.12.x)
python --version
# Upgrade pip
pip install --upgrade pip
# Install vLLM (this takes ~5-10 minutes)
pip install vllm
2. Launch Deployment
# Make sure you're in your working directory and activate virtual environment
source venv/bin/activate
# Launch vLLM serve (full native context)
vllm serve OpenGVLab/InternVL3-2B \
--trust-remote-code \
--dtype bfloat16 \
--max-model-len 16384 \
--max-num-batched-tokens 16384 \
--gpu-memory-utilization 0.9 \
--port 8000
3. Verification
3.1. Check server is running:
curl http://localhost:8000/v1/models
Expected output:
{
"object": "list",
"data": [{
"id": "OpenGVLab/InternVL3-2B",
"object": "model",
"created": 1770107841,
"owned_by": "vllm",
"root": "OpenGVLab/InternVL3-2B",
"parent": null,
"max_model_len": 16384
}]
}
3.2. Test text inference:
curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "OpenGVLab/InternVL3-2B",
"prompt": "Hello, how are you?",
"max_tokens": 50
}'
3.3. Test video inference:
# Encode video to base64
VIDEO_BASE64=$(base64 -w 0 /path/to/your/video.mp4)
# Time the request and measure latency
time curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d "{
\"model\": \"OpenGVLab/InternVL3-2B\",
\"messages\": [{
\"role\": \"user\",
\"content\": [
{
\"type\": \"video_url\",
\"video_url\": {
\"url\": \"data:video/mp4;base64,\$VIDEO_BASE64\"
}
},
{
\"type\": \"text\",
\"text\": \"Describe the user actions in this video in a sequential list.\"}
]
}],
\"max_tokens\": 1024
}"
Example output:
Certainly. In the video, a man wearing a red jersey is standing on a soccer field.
He appears to be clapping his hands or gesturing with them while speaking, expressing his comments or opinions...
Video inference latency: 0.87s (real time) - Extremely fast!
4. Troubleshooting
Issue 1: Trust remote code required
- Symptom: "ValueError: trust_remote_code is required for this model"
- Solution: Add
--trust-remote-codeflag (already included in deployment command)
Issue 2: Out of memory
- Symptom: CUDA out of memory
- Solution: This model is small (2B params), only needs ~10GB. If OOM occurs, stop other models or reduce
--gpu-memory-utilizationto 0.5
Issue 3: Slow compilation on first run
- Symptom: Server takes 2 minutes to fully start
- Solution: This is normal. Torch compilation caches graphs for future use. Subsequent startups will be faster.
5. Notes
- Context length: Model supports 16,384 tokens (session length)
- Multimodal support: Supports both image and video inputs
- Vision backbone: InternViT-300M-448px-V2.5
- LLM backbone: Qwen2.5-1.5B
- Flash Attention: Automatically enabled for vision encoder (AttentionBackendEnum.FLASH_ATTN)
- Torch Compilation: Enabled for backbone with Inductor backend for better performance
- Variable Visual Position Encoding (V2PE): Improves long-context understanding
- Extremely fast: 0.87s video inference latency is exceptional for a 2B model
6. Model Specifications
- Model size: 2B parameters
- Architecture: InternVL3 (InternViT-300M-448px-V2.5 + Qwen2.5-1.5B)
- Context window: 16,384 tokens
- License: Apache-2.0
- Quantization support: 8-bit quantization supported via
load_in_8bit=True - Primary deployment: LMDeploy (lmdeploy>=0.7.3), but vLLM works great
7. References
- Model Card: https://huggingface.co/OpenGVLab/InternVL3-2B
- Paper: https://arxiv.org/abs/2504.10479
- GitHub: https://github.com/OpenGVLab/InternVL
- Minimum vLLM version: Not specified (deployed with 0.15.0)