How to Deploy InternVL3.5-8B with vLLM
Younes El Hjouji
Detailed Deployment Instructions
1. Environment Setup
# Create a working directory (choose your preferred location)
# Create virtual environment with Python 3.12
# Make sure you have Python 3.12 installed: sudo apt install python3.12 python3.12-venv
python3.12 -m venv venv
# Activate virtual environment
source venv/bin/activate
# Verify Python version (should be 3.12.x)
python --version
# Upgrade pip
pip install --upgrade pip
# Install vLLM (this takes ~5-10 minutes)
pip install vllm
2. Launch Deployment
# Make sure you're in your working directory and activate virtual environment
source venv/bin/activate
# Launch vLLM serve (full native context)
vllm serve OpenGVLab/InternVL3_5-8B \
--trust-remote-code \
--dtype bfloat16 \
--max-model-len 32768 \
--max-num-batched-tokens 32768 \
--gpu-memory-utilization 0.9 \
--port 8000
3. Verification
3.1. Check server is running:
curl http://localhost:8000/v1/models
Expected output:
{
"object": "list",
"data": [{
"id": "OpenGVLab/InternVL3_5-8B",
"object": "model",
"created": 1770105248,
"owned_by": "vllm",
"root": "OpenGVLab/InternVL3_5-8B",
"parent": null,
"max_model_len": 32768
}]
}
3.2. Test text inference:
curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "OpenGVLab/InternVL3_5-8B",
"prompt": "Hello, how are you?",
"max_tokens": 50
}'
Expected output:
{
"choices": [{
"text": " I'm good, I just wanted to do a little experiment..."
}]
}
3.3. Test video inference:
# Encode video to base64
VIDEO_BASE64=$(base64 -w 0 /path/to/your/video.mp4)
# Time the request and measure latency
time curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d "{
\"model\": \"OpenGVLab/InternVL3_5-8B\",
\"messages\": [{
\"role\": \"user\",
\"content\": [
{
\"type\": \"video_url\",
\"video_url\": {
\"url\": \"data:video/mp4;base64,\$VIDEO_BASE64\"
}
},
{
\"type\": \"text\",
\"text\": \"Describe the user actions in this video in a sequential list.\"
}
]
}],
\"max_tokens\": 1024
}"
Example output:
Certainly! Here is a detailed description of the user actions in a sequential list based on the provided video frames:
1. **Initial Clapping**:
- The individual starts clapping their hands together.
2. **Separating Hands**:
- The person separates their hands slightly while continuing to clap.
3. **Double Clapping Motion**:
- The clapping motion continues in a rhythmic double clap (clapping each hand separately).
4. **Movement to the Right**:
- The individual begins to move to the right side of the frame while continuing to clap.
...
Video inference latency: 2.15s (real time)
4. Troubleshooting
Issue 1: Trust remote code required
- Symptom: "ValueError: trust_remote_code is required for this model"
- Solution: Add
--trust-remote-codeflag (already included in deployment command)
Issue 2: Out of memory with full context
- Symptom: CUDA out of memory when using 32768 context
- Solution: Reduce
--max-model-lento 16384 and--gpu-memory-utilizationto 0.5 for lower memory usage
Issue 3: Slow compilation on first run
- Symptom: Server takes 2-3 minutes to fully start
- Solution: This is normal. Torch compilation caches graphs for future use. Subsequent startups will be faster.
5. Notes
- Context length: Model natively supports 32K tokens (from SFT stage). Solo deployment uses full 32K context.
- Multimodal support: Supports both image and video inputs
- Vision backbone: InternViT-300M-448px
- LLM backbone: Qwen3-8B
- Flash Attention: Automatically enabled for vision encoder (AttentionBackendEnum.FLASH_ATTN)
- Torch Compilation: Enabled for backbone with Inductor backend for better performance
- Chunked prefill: Enabled with max_num_batched_tokens=32768
- Asynchronous scheduling: Enabled for better throughput
- Prefix caching: Enabled by default
- Thinking mode: Can be enabled by setting R1_SYSTEM_PROMPT (see model card for details)
6. Model Specifications
- Model size: 9B parameters (9.4B total: 0.3B vision + 8.2B LLM)
- Architecture: InternVL3.5 (InternViT-300M + Qwen3-8B)
- Context window: 32,768 tokens
- Vision tokens: 256 tokens per image patch (compressed from 1024)
- License: Apache-2.0
- Quantization support: BNB 8-bit supported
7. References
- Model Card: https://huggingface.co/OpenGVLab/InternVL3_5-8B
- Paper: https://huggingface.co/papers/2508.18265
- GitHub: https://github.com/OpenGVLab/InternVL
- Minimum vLLM version: Not specified (deployed with 0.15.0)