← Back to all posts

How to Deploy Molmo2-O-7B with vLLM

Younes El Hjouji

Quick Deployment

# Setup
python3.12 -m venv venv && source venv/bin/activate
pip install --upgrade pip && pip install vllm

# Launch (full native context with YaRN)
vllm serve allenai/Molmo2-O-7B \
  --trust-remote-code \
  --dtype bfloat16 \
  --max-model-len 65536 \
  --max-num-batched-tokens 65536 \
  --gpu-memory-utilization 0.9 \
  --port 8000

# Test
curl http://localhost:8000/v1/models

Video Test:

VIDEO_BASE64=$(base64 -w 0 /path/to/video.mp4)
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d "{
    \"model\": \"allenai/Molmo2-O-7B\",
    \"messages\": [{
      \"role\": \"user\",
      \"content\": [
        {\"type\": \"video_url\", \"video_url\": {\"url\": \"data:video/mp4;base64,\$VIDEO_BASE64\"}},
        {\"type\": \"text\", \"text\": \"Describe this video.\"}
      ]
    }],
    \"max_tokens\": 100
  }"

Configuration Options

ConfigGPU %ContextMemoryStatus
Solo (Recommended)90%65536~130GB✅ Full native context
Standard65%8192~93GB✅ Tested
Conservative55%4096~79GB est.Untested

Notes

  • Based on OLMo-3-7B-Instruct (open source LLM backbone)
  • Native context: 65,536 tokens with YaRN extension
  • Faster inference than 4B/8B variants (~4.9s vs ~6s)
  • Apache-2.0 license
  • Requires H100/H200 or A100 80GB
  • Part of "O" series (OLMo-based)

Quick Reference: Port 8040 | 7B params | ~93GB | OLMo-3 backbone