How to Deploy Molmo2-VideoPoint-4B with vLLM
Younes El Hjouji
Quick Deployment
# Setup
python3.12 -m venv venv && source venv/bin/activate
pip install --upgrade pip && pip install vllm
# Launch (full native context)
vllm serve allenai/Molmo2-VideoPoint-4B \
--trust-remote-code \
--dtype bfloat16 \
--max-model-len 30000 \
--max-num-batched-tokens 30000 \
--gpu-memory-utilization 0.9 \
--port 8000
# Test
curl http://localhost:8000/v1/models
Video Test (Counting):
VIDEO_BASE64=$(base64 -w 0 /path/to/video.mp4)
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d "{
\"model\": \"allenai/Molmo2-VideoPoint-4B\",
\"messages\": [{
\"role\": \"user\",
\"content\": [
{\"type\": \"video_url\", \"video_url\": {\"url\": \"data:video/mp4;base64,\$VIDEO_BASE64\"}},
{\"type\": \"text\", \"text\": \"How many people are in this video?\"}
]
}],
\"max_tokens\": 100
}"
Example Output (with spatial coordinates):
Counting the <points coords="0.0 1 490 579 2 900 692">people</points> shows a total of 2.
The model outputs bounding box coordinates for counted objects.
Configuration Options
| Config | GPU % | Context | Memory | Status |
|---|---|---|---|---|
| Standard | 60% | 8192 | ~87GB | ✅ Tested |
| Conservative | 50% | 4096 | ~73GB est. | Untested |
| High | 70% | 16384 | ~100GB est. | Untested |
Special Features
Video Pointing & Counting:
- Specialized model for counting objects in videos
- Outputs spatial coordinates in
<points coords="...">format - Coordinates format:
frame_id x1 y1 x2 y2 ... - Useful for:
- Counting people/objects in videos
- Tracking object positions across frames
- Spatial video analysis
Example Use Cases:
- "How many cars pass by?"
- "Count the people in the video"
- "Where is the ball located?"
- "Point to the person on the left"
Notes
- Specialized model - optimized for video pointing/counting only
- Based on Qwen3-4B-Instruct-2507
- Native context: 30,000 tokens
- Outputs include spatial bounding boxes
- Apache-2.0 license
- Requires H100/H200 or A100 80GB
- Not suitable for general video understanding (use Molmo2-4B/8B/O-7B instead)
Quick Reference: Port 8041 | 5B params | ~87GB | Pointing specialist