Full Sail on Asynchronous Inference

Today all inference is real-time. A human types, a model responds, & the clock starts over. The infrastructure is built for someone waiting on the other end. Every millisecond of latency costs money because the serving stack optimizes for cold-start, not throughput.

As we built internal AI systems at Theory, we embraced queueing. Parallelize ten agents on a single task, let them run for hours, & the productivity gains are enormous. It is the product of token-maxxing,¹ pushing every dollar of compute to do more work. But the cost was unsustainable.

That is when we met Neil Movva & Samir Menon of Sail Research.²

Neil Movva built one of the fastest LLM inference stacks at Together AI. Samir Menon ran LLMs inside hardware enclaves at Blyss. Both are systems engineers to the core. They were building the system we needed.

As the inference market segments into real-time, near-real-time, & batch, async inference sits in the batch tier & carries a massive cost advantage.³ The key is model selection & routing.

Sail distributes requests across open models like DeepSeek, Qwen, Kimi, & GLM, picking the cheapest capable model for each task. GLM-5.1 on Sail costs 6x less per token than Anthropic’s Haiku.⁴ Wait two minutes instead of two seconds for a code review, & the same token costs 6x less.

Sail uses spot capacity when it is available & fails over to reliable compute when it is not. Fleet-aware orchestration keeps utilization high & cost low.

Real-time stacks reserve capacity per request. Queued stacks pack requests into idle capacity. Different architecture, different economics.

Sailboxes are cloud computers for the bursty rhythm of agents. A sailbox stays alive as long as the agent needs, holds state across the entire task, pauses when it waits on inference, & resumes in seconds when the response arrives. You pay for active time. No paying for idle.

Sail has served trillions of tokens to customers in code review, deep research, & cybersecurity.²

Today we announced our Series A investment in Sail alongside Kleiner Perkins, Redpoint, & Sequoia.

As agents grow from chat assistants into background workers scanning codebases overnight, enriching every CRM row, processing every document, the vast majority of tokens will flow through a queue. The future runs in the background. We are thrilled to partner with Neil, Samir, & the entire Sail team.

If you’re building agents, get started here.

Token-maxxing ↩︎
Sail Research ↩︎ ↩︎
Darwinian Specialization in AI ↩︎
Sail Research : Cost Efficiency - Input tokens per dollar, comparing Sail’s GLM-5.1 to Anthropic Haiku 4.5 & OpenAI GPT-5.4-mini. ↩︎