How to Deploy InternVL3-8B with vLLM
Younes El Hjouji
Quick Deployment
# Setup
python3.12 -m venv venv && source venv/bin/activate
pip install --upgrade pip && pip install vllm
# Launch
vllm serve OpenGVLab/InternVL3-8B \
--trust-remote-code \
--dtype bfloat16 \
--max-model-len 32768 \
--max-num-batched-tokens 32768 \
--gpu-memory-utilization 0.9 \
--port 8000
# Test
curl http://localhost:8000/v1/models
Video Test:
VIDEO_BASE64=$(base64 -w 0 /path/to/video.mp4)
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d "{
\"model\": \"OpenGVLab/InternVL3-8B\",
\"messages\": [{
\"role\": \"user\",
\"content\": [
{\"type\": \"video_url\", \"video_url\": {\"url\": \"data:video/mp4;base64,\$VIDEO_BASE64\"}},
{\"type\": \"text\", \"text\": \"Describe this video.\"}
]
}],
\"max_tokens\": 1024
}"
Specifications
- Parameters: 8B
- Architecture: InternVL3 (InternViT-300M + Qwen2.5-7B)
- Context: 32,768 tokens
- License: Apache-2.0
- Backbones: InternViT-300M-448px-V2.5, Qwen2.5-7B
References
- Model Card: https://huggingface.co/OpenGVLab/InternVL3-8B
- Paper: https://arxiv.org/abs/2504.10479
- GitHub: https://github.com/OpenGVLab/InternVL