← Back to all posts

How to Deploy InternVL3.5-8B with vLLM

Younes El Hjouji

Detailed Deployment Instructions

1. Environment Setup

# Create a working directory (choose your preferred location)

# Create virtual environment with Python 3.12
# Make sure you have Python 3.12 installed: sudo apt install python3.12 python3.12-venv
python3.12 -m venv venv

# Activate virtual environment
source venv/bin/activate

# Verify Python version (should be 3.12.x)
python --version

# Upgrade pip
pip install --upgrade pip

# Install vLLM (this takes ~5-10 minutes)
pip install vllm

2. Launch Deployment

# Make sure you're in your working directory and activate virtual environment
source venv/bin/activate

# Launch vLLM serve (full native context)
vllm serve OpenGVLab/InternVL3_5-8B \
  --trust-remote-code \
  --dtype bfloat16 \
  --max-model-len 32768 \
  --max-num-batched-tokens 32768 \
  --gpu-memory-utilization 0.9 \
  --port 8000

3. Verification

3.1. Check server is running:

curl http://localhost:8000/v1/models

Expected output:

{
  "object": "list",
  "data": [{
    "id": "OpenGVLab/InternVL3_5-8B",
    "object": "model",
    "created": 1770105248,
    "owned_by": "vllm",
    "root": "OpenGVLab/InternVL3_5-8B",
    "parent": null,
    "max_model_len": 32768
  }]
}

3.2. Test text inference:

curl http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "OpenGVLab/InternVL3_5-8B",
    "prompt": "Hello, how are you?",
    "max_tokens": 50
  }'

Expected output:

{
  "choices": [{
    "text": " I'm good, I just wanted to do a little experiment..."
  }]
}

3.3. Test video inference:

# Encode video to base64
VIDEO_BASE64=$(base64 -w 0 /path/to/your/video.mp4)

# Time the request and measure latency
time curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d "{
    \"model\": \"OpenGVLab/InternVL3_5-8B\",
    \"messages\": [{
      \"role\": \"user\",
      \"content\": [
        {
          \"type\": \"video_url\",
          \"video_url\": {
            \"url\": \"data:video/mp4;base64,\$VIDEO_BASE64\"
          }
        },
        {
          \"type\": \"text\",
          \"text\": \"Describe the user actions in this video in a sequential list.\"
        }
      ]
    }],
    \"max_tokens\": 1024
  }"

Example output:

Certainly! Here is a detailed description of the user actions in a sequential list based on the provided video frames:

1. **Initial Clapping**:
    - The individual starts clapping their hands together.

2. **Separating Hands**:
    - The person separates their hands slightly while continuing to clap.

3. **Double Clapping Motion**:
    - The clapping motion continues in a rhythmic double clap (clapping each hand separately).

4. **Movement to the Right**:
    - The individual begins to move to the right side of the frame while continuing to clap.
...

Video inference latency: 2.15s (real time)

4. Troubleshooting

Issue 1: Trust remote code required

  • Symptom: "ValueError: trust_remote_code is required for this model"
  • Solution: Add --trust-remote-code flag (already included in deployment command)

Issue 2: Out of memory with full context

  • Symptom: CUDA out of memory when using 32768 context
  • Solution: Reduce --max-model-len to 16384 and --gpu-memory-utilization to 0.5 for lower memory usage

Issue 3: Slow compilation on first run

  • Symptom: Server takes 2-3 minutes to fully start
  • Solution: This is normal. Torch compilation caches graphs for future use. Subsequent startups will be faster.

5. Notes

  • Context length: Model natively supports 32K tokens (from SFT stage). Solo deployment uses full 32K context.
  • Multimodal support: Supports both image and video inputs
  • Vision backbone: InternViT-300M-448px
  • LLM backbone: Qwen3-8B
  • Flash Attention: Automatically enabled for vision encoder (AttentionBackendEnum.FLASH_ATTN)
  • Torch Compilation: Enabled for backbone with Inductor backend for better performance
  • Chunked prefill: Enabled with max_num_batched_tokens=32768
  • Asynchronous scheduling: Enabled for better throughput
  • Prefix caching: Enabled by default
  • Thinking mode: Can be enabled by setting R1_SYSTEM_PROMPT (see model card for details)

6. Model Specifications

  • Model size: 9B parameters (9.4B total: 0.3B vision + 8.2B LLM)
  • Architecture: InternVL3.5 (InternViT-300M + Qwen3-8B)
  • Context window: 32,768 tokens
  • Vision tokens: 256 tokens per image patch (compressed from 1024)
  • License: Apache-2.0
  • Quantization support: BNB 8-bit supported

7. References