← Back to all posts

How to Deploy Molmo2-VideoPoint-4B with vLLM

Younes El Hjouji

Quick Deployment

# Setup
python3.12 -m venv venv && source venv/bin/activate
pip install --upgrade pip && pip install vllm

# Launch (full native context)
vllm serve allenai/Molmo2-VideoPoint-4B \
  --trust-remote-code \
  --dtype bfloat16 \
  --max-model-len 30000 \
  --max-num-batched-tokens 30000 \
  --gpu-memory-utilization 0.9 \
  --port 8000

# Test
curl http://localhost:8000/v1/models

Video Test (Counting):

VIDEO_BASE64=$(base64 -w 0 /path/to/video.mp4)
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d "{
    \"model\": \"allenai/Molmo2-VideoPoint-4B\",
    \"messages\": [{
      \"role\": \"user\",
      \"content\": [
        {\"type\": \"video_url\", \"video_url\": {\"url\": \"data:video/mp4;base64,\$VIDEO_BASE64\"}},
        {\"type\": \"text\", \"text\": \"How many people are in this video?\"}
      ]
    }],
    \"max_tokens\": 100
  }"

Example Output (with spatial coordinates):

Counting the <points coords="0.0 1 490 579 2 900 692">people</points> shows a total of 2.

The model outputs bounding box coordinates for counted objects.


Configuration Options

ConfigGPU %ContextMemoryStatus
Standard60%8192~87GB✅ Tested
Conservative50%4096~73GB est.Untested
High70%16384~100GB est.Untested

Special Features

Video Pointing & Counting:

  • Specialized model for counting objects in videos
  • Outputs spatial coordinates in <points coords="..."> format
  • Coordinates format: frame_id x1 y1 x2 y2 ...
  • Useful for:
    • Counting people/objects in videos
    • Tracking object positions across frames
    • Spatial video analysis

Example Use Cases:

  • "How many cars pass by?"
  • "Count the people in the video"
  • "Where is the ball located?"
  • "Point to the person on the left"

Notes

  • Specialized model - optimized for video pointing/counting only
  • Based on Qwen3-4B-Instruct-2507
  • Native context: 30,000 tokens
  • Outputs include spatial bounding boxes
  • Apache-2.0 license
  • Requires H100/H200 or A100 80GB
  • Not suitable for general video understanding (use Molmo2-4B/8B/O-7B instead)

Quick Reference: Port 8041 | 5B params | ~87GB | Pointing specialist