Blog & deployment guides
Latest posts
Understanding TCP for Server-to-GPU Communication
TCP was built to be reliable everywhere, but for heavy inference payloads it can be a latency bottleneck. A walkthrough of connection reuse, congestion window tuning, and why TCP behaves the way it does, with interactive simulations.
Can VLMs Understand Without Generating?
Vision language models have structural limits that affect visual understanding. We explore whether the strengths of generative models can transfer to VLMs and help close that gap.
Overshoot x CVPR 2026: Papers on Real-Time Vision and Inference
CVPR 2026 accepted ~4,000 papers; ~50 touch on making vision-language models work in real time. A shortlist of seven, across streaming token/KV-cache compression, streaming inference and reasoning, and interaction models, and why each matters for real-time vision.
Reinventing the multimodal processor for 5x lower latency
A 15-frame Qwen video took 428 ms p90 of CPU preprocessing before the request even touched the GPU. We rewrote the processor bit-identical at 81 ms p90, a 5x reduction across the Qwen3-VL, Qwen3.5, and Qwen3.6 lineup. A walkthrough of where the latency hid, why threading wasn't the fix, and the OMP_NUM_THREADS=1 default that doubled our latency from inside vllm.
How a small fix improves Gemma 4's performance by 10x on vision tasks
A simple fix in the vLLM video encoder path improves Gemma 4's performance on vision tasks by 10× while preserving bit identity. A walkthrough of continuous batching, the serial vision-tower loop, and the bit-identity trap that blocked batching the obvious way.
Day Zero Gemma 4 Support on Overshoot
Google is back on the open model map with Gemma 4. Four multimodal models, released today, now available on Overshoot for real-time inference.
Survey of Open Source Vision Language Models (2026)
A comprehensive catalog of every significant open-source video language model released since December 2024. 56+ models across 13 families, with deployment support, context lengths, and download stats.
Qwen 3.5: Architecture, Benchmarks, and Model Selection
A technical breakdown of the Qwen 3.5 family — five natively multimodal models from 2B to 35B. Hybrid attention architecture, benchmark analysis against GPT-5-mini and Claude Sonnet 4.5, and guidance on which model to pick.
Deploying VLMs through vLLM: The Inference Field Guide
Reproducible deployment guides for 30 vision language models across 11 families. Complete setup instructions, deployment commands, troubleshooting, and benchmarks.