How to Deploy Molmo2-O-7B with vLLM
Younes El Hjouji
Quick Deployment
# Setup
python3.12 -m venv venv && source venv/bin/activate
pip install --upgrade pip && pip install vllm
# Launch (full native context with YaRN)
vllm serve allenai/Molmo2-O-7B \
--trust-remote-code \
--dtype bfloat16 \
--max-model-len 65536 \
--max-num-batched-tokens 65536 \
--gpu-memory-utilization 0.9 \
--port 8000
# Test
curl http://localhost:8000/v1/models
Video Test:
VIDEO_BASE64=$(base64 -w 0 /path/to/video.mp4)
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d "{
\"model\": \"allenai/Molmo2-O-7B\",
\"messages\": [{
\"role\": \"user\",
\"content\": [
{\"type\": \"video_url\", \"video_url\": {\"url\": \"data:video/mp4;base64,\$VIDEO_BASE64\"}},
{\"type\": \"text\", \"text\": \"Describe this video.\"}
]
}],
\"max_tokens\": 100
}"
Configuration Options
| Config | GPU % | Context | Memory | Status |
|---|---|---|---|---|
| Solo (Recommended) | 90% | 65536 | ~130GB | ✅ Full native context |
| Standard | 65% | 8192 | ~93GB | ✅ Tested |
| Conservative | 55% | 4096 | ~79GB est. | Untested |
Notes
- Based on OLMo-3-7B-Instruct (open source LLM backbone)
- Native context: 65,536 tokens with YaRN extension
- Faster inference than 4B/8B variants (~4.9s vs ~6s)
- Apache-2.0 license
- Requires H100/H200 or A100 80GB
- Part of "O" series (OLMo-based)
Quick Reference: Port 8040 | 7B params | ~93GB | OLMo-3 backbone